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Preface 


THE purposes stated for this book in the original edition have also guided 
its revision. The basic course in testing should present the principles of 
udent will learn to choose tests wisely for 


testing in such a way that the st 
are of the potentialities and limitations of 


particular needs, and will be aw n 
the tests he chooses. We now have a large number of general principles of 
testing to aid in such evaluation and interpretation. 

Psychological testing has been advanced chiefly by two lines of work: 
one, the practical and clinical application of instruments; the other, the 
theoretical and mathematical analysis of testing problems. These two lines 
of thought have often remained independent, so that test interpretations 
employed by clinicians and counselors frequently appear untrustworthy 
when judged by psychometric standards. Conversely, the clinician often 
finds the precisely designed and narrowly focused tests that come from the 
psychometric specialist unsatisfactory because they do not serve his practical 
needs, The cleavage between these two schools of thought has been reduced 
during the past decade, on the one hand by the increased concern of clini- 
cians for the rigorous specification of hypotheses and validation, on the 
other hand by the broadening of psychometric theory to make a place for 
tests designed for purposes other than prediction of a specific criterion. This 

actical and technical perspectives, so that 


book views tests from both the pr 
the industrial, clinical, educational, or military psychologist will learn how 


the psychometric specialist evaluates tests and the psychometric specialist 
will understand the practical requirements which tests must meet. 

The book is intended to serve the needs of undergraduates and beginning 
graduate students in psychology and counseling. It makes no attempt to 
exhaust any one of the fields of testing; rather, it covers those essentials on 
which later study of such specialties as industrial selection, clinical case in- 
terpretation, or test theory may be based. 

There has been substantial change from the first edition of the book, 
though the broad outline and aims remain the same and most of the basic 
principles stand unchanged. The past decade has seen notable advances in 
testing and test theory, including the Technical Recommendations for psy- 
chological and achievement tests and the associated reformulation of con- 
cepts of validity, the extensive validation of differential aptitude batteries, 


xix 
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and the decline of diagnostic pattern interpretation of the Wechsler intelli- 
gence scale from a widely accepted practice to a discredited hypothesis. 
One of the most striking changes in this period has been the improved 
quality of the information supplied by test publishers. For many tests, flimsy 
and inadequate manuals have been replaced by technical handbooks of 
monograph length, thereby increasing the importance of skill in interpreting 
information about reliability, validity, and norms. 

In my teaching, I place particular emphasis upon this skill, the principal 
assignments being concerned with reviewing of tests and selecting tests for 
particular programs (e.g., guidance of freshmen in a described liberal arts 
college). The presentation of specific tests in this book is designed to assist 
in this function and not to substitute for it. Tests selected for extended de- 
scription have wide application, illustrate important techniques and types of 
evidence, or illustrate significant principles. The space given to a test is by 
no means an indication of its merit; perhaps the prime determiner of inclu- 
sion has been the amount and variety of relevant information available, 
which biases the selection toward older tests. In order to introduce the 
student to a wider range of tests, a summary listing is given in many chap- 
ters. This summary is primarily a set of suggestions for further study. The 
annotation is too brief to serve as a critical review, and perhaps carries 
favorable or unfavorable connotations which I did not intend. I have ac- 
cepted this risk in order to provide a preliminary guide to the beginner lost 
in the morass of test titles. It is urged that the reader use the summaries only 
to prepare a list of tests to be studied further, bearing in mind that even 
this summary covers only a fraction of the tests on the market. A decision 
about the merit of a test must come after 


and other sources. 
n the United States 
he main content of 
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tests with which his acquaintance is r 
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E E alizing on 
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through the questions on each section, the reader sees how the principles 
apply and becomes aware of topics which require further thought. The 
questions do not always have specific answers. Frequently they are de- 
liberately controversial, or can be answered only by a qualified “Yes, but—.” 
The student who sees two sides to any of the questions can have considerable 
confidence that he is doing good thinking. 

In accomplishing my purposes, I have been greatly aided by my profes- 
sional associations of the past ten years. Particularly broadening were the 
intimate association over a five-year period with the Committees on Test 
Standards of APA and other associations, the opportunity to pursue research 
in test theory made possible by the Bureau of Educational Research of the 
University of Illinois and the Office of Naval Research, and the opportunity 


given me by the Office of Naval Research and the National Institutes of 


Health to become acquainted with test research and applications at home 


and abroad. My colleagues in these ventures taught me much about tests. 
Howard B. Lyman, Russell P. Kropp, and Andrew Baggaley gave sugges- 
tions for the revision, and Jean W. Macfarlane s criticisms of this manuscript 
led to many improvements. I wish also to give special thanks to the repre- 
sentatives of various test publishing houses, all of whom have been most 
coóperative in supplying information about their tests and in helping me to 
clarify my ideas. As always, the students on whom these ideas have been 
tried were a major source of motivation and insight. Mrs. Lester M. Friend s 
services as typist of many drafts of the manuscript are acknowledged with 


appreciation. 
PE Lxx J. CRONBACH 


September, 1959 


PART ONE | 


BASIC CONCEPTS 


_ 
Who Uses Tests? 


THE testing movement stands as a prime example of social science in ac- 
tion, since it touches on vital questions in all phases of our life. What is char- 
acter, and what sorts of children have good character? What personality 
make-up promises that an adolescent will be a stable, effective adult? How 
can we tell which 6-year-olds are ready to begin learning to read? Is this 
young man a good prospect for training in watchmaking, or should he go 
into a different vocation—say steamfitting or patternmaking? Such are the 
problems toward which testing and research on individual differences are 
directed. In this book, we will survey the methods which have been and are 


being developed to solve these problems. 


TYPICAL TEST USERS 


One way to get a quick overview of the region we are to explore is to find 
out what testers do. By meeting a few of the people who work with tests we 
can get an impression of the variety of services tests perform and of the way 
they fit into a psychological career. The people to be described are imagi- 
nary, each one being a composite portrait of many psychologists such as can 
be found in every part of the country. 

Let’s begin by calling on Helen Kimball. At about eleven on a January 
morning, we find her at her desk in the central administration building of 
the school system of Riverton, population 17,000. Miss Kimball is dark, at- 
tractive, 3bish. Her position bears the title School Psychologist. The office in 
which we find her is unusually bright, with decorative pictures, drapes, and 
a table low enough to accommodate a child. On the table are spread several 
objects: blocks, a cutout puzzle, a folder of pictures. 

Miss Kimball apologizes for the disorder of the table as she greets us. "I 
just finished testing a boy and haven't had time to clean up the materials. 
Usually I keep just a toy or two on the table, to attract the interest of any 


child sent down to see me. These test materials are from the Wechsler in- 
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telligence scale and a picture test for studying personality called the The- 
matic Apperception Test" When we express interest in her case, and in- 
quire about the reason for testing the boy, she outlines his background as 
follows. 

Charles is a boy from a foreign home, middle-to-low economic status, who 
in the fourth grade suddenly is causing trouble after having been known as a 
friendly, successful pupil in other grades. His teacher reports that he has 
made almost no progress in school subjects since the start of the year, that he 
refuses her attempts to give him extra help, and that he has begun to disturb 
the class by hitting other boys, taking objects from the girls to annoy them, 
and similar misdemeanors. A check with the files showed that his previous 
teachers had made many favorable reports: ^A fine worker. Does everything 
a little better than most other boys.” “Learns new ideas quickly. Good at 
number work." But the objective tests given at the end of the third grade 
showed that he was not superior. In fact, in reading comprehension Charles 
was two months behind the average pupil of his class, and in arithmetic, his 
best score, he just reached the average. Probably the teachers were misled 
by his cheerfulness and industry into overrating his past learning. 

“Now,” says Miss Kimball, "they asked me to try to determine the causes 
of his problem. Teachers in each school check most of the cases; for instance, 
they give intelligence tests and reading tests, and make studies of the chil- 
dren the school needs to know more about. Charles was sent to me be- 
cause the teacher felt his behavior presented an especially serious problem. 
The school did have a mental-test record, because Charles' class took the 
Kuhlmann-Anderson group intelligence test two months ago. Charles' IQ 
was only 65. But his teacher said Charles wouldn't work on the test. He did 
a few items, then stopped and looked out the window; when she urged him 
to go ahead, he worked slowly, and seemed not to be trying. 

"So my first problem was to try to find out how bright Charles is, to learn 
what to expect of him. The Wechsler or the Stanford-Binet is our usual me 
ure. Since we give these tests individually, 
When I gave the Wechsler this morning, Charles did about as well as most 
10-year-olds; I haven't computed his IQ yet, but from the impression I 
formed as I gave the test, it will come out about 90 to 100—just a trifle below 
average. The score might be affected by his schooling, as many of the ques- 
tions use language. The Performance section of the test, though, uses blocks, 
picture puzzles, and other tasks not likely to be affected by schooling, and 
he did about the same as on the Verbal section; apparently language diffi- 
culties aren't his big problem. I was pleased that he coóperated, since he'd 
had trouble before. He was eager to work, cheerful, and seemed pleased with 


his accomplishment. But of course we started out slowly, and I made a great 
effort to interest him in the ‘games.’ 


as- 
most children coóperate well. 
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“I did two other things with Charles. Usually we don't test so much in one 
day, but the school wants to make some decisions about Charles at midyear. 
So we broke off the testing and chatted awhile; then I gave him a vision test. 
I chose that because I noticed some squinting during the intelligence test, 
and the teacher had noted a few complaints of headaches. My vision tests 
it they showed a little deficiency in one eye. 


aren't as precise as an oculist’s, bu 
Worse, though, is his coórdination; the eyes don't work together, but instead 


look at slightly different parts of the page when he is reading. This probably 
can be corrected, but we'll need further visual tests to be sure. Poor visual 
coürdination would cause trouble in reading and lead to fatigue. 

"Since the emotional problem seemed to be severe, judging from the re- 
ports of Charles’ social behavior, I used my picture test. The child makes up 
stories about each picture, and the stories often reveal his worries and wishes. 
Here's one picture, for example, showing a boy huddled up in a corner. 
Charles made up a story that the boy was playing with the others and they 
made him stop and go home. The other boys said he had a different way of 
playing that wasn't right. Several stories like that suggest that Charles is 
greatly worried about losing his friends, and about ‘being different.’ The test 
gives many other suggestions about Charles’ problems, but I need to study 


the record before I form definite conclusions. 
“Our next steps will be to check on the vision problem and to clarify the 
ve several conferences with Charles, helping 


emotional difficulties. TIl ha 
him talk out his difficulties. Then we will see what can be done to help him 


solve them. The fact that he has normal mental ability is encouraging, since 
we know he can do well if his adjustment improves. It will help to know that 
he is average rather than superior, as past teachers suggested. Perhaps he's 
Bád:toive up to too high a reputation. We may use further tests later; the 
ones used so far have narrowed our field of investigation, so that my con- 
ferences with Charles will be effective.” » 

This sample gives some idea of Miss Kimball's work. No two cases are just 
alike, nor are the same tests appropriate for every case. In contrast to this 
k of a personnel manager for a department 


"clinical" approach is the wor à 
store. This is a store with about 350 employees, ranging from roustabouts to 


buyers and office personnel. Edward Blake, the personnel manager, is a 
heavy-set, graying man of 45, who seems interested in whatever we have to 
say. But there is also a briskness, à sticking-to-a-schedule. “The routines of 
the job? I don't do much testing myself; but I do interview everybody we 
hire. That helps the store, because every employee knows there's someone 
here in the office who has met him and to whom he can take his problems. 

“When an applicant comes in, he fills out a personal-history blank, and my 
assistant, Miss Field, gives him a set of tests. The tests aren't quite the same 
for everybody. We give all applicants a short multiple-choice intelligence 


6 ESSENTIALS OF PSYCHOLOGICAL TESTING 


test, since different jobs in the store call for employees of different caliber. 
Most applicants get a test of simple arithmetic—addition, percentages, dis- 
counts, and so on. For package wrappers and merchandise handlers, we use 
a simple test of motor ability in which they place wooden cubes in a box as 
rapidly as possible. It doesn't predict who'll be the best employee, but it 
saves us from some lemons. For a few departments, we have trade tests, 
tests of information about the job. Some men claim to be shoe salesmen 
when they don't know a last from a counter. These tests check on the experi- 
ence the applicant claims in his application blank. 

"Whatever tests Miss Field gives are scored and recorded on the 
tion blank. Then, when there is a vacancy, we pull out the names of people 
who have the qualifications that job requires. I call in one or more of these 
people, interview them, and if I think they'll do, I hire them. The tests are 
most useful to sort out the good from the poor prospects. Miss Field can 
give the tests very easily, and it saves us a lot of time we'd spend interview- 
ing people who wouldn't be good workers. Of course, Miss Field does a nice 
job, making sure each person knows we're interested in him, and sending 
each one away with a feeling that he's had fair consideration." 

Mr. Blake, of course, is a little different fro 
ager we might have talked to. But his work i 
nesses having substantial turnover. 

Unlike Miss Kimball and Mr. Blake, Max 
using tests for research which will have onl 
We find them in the Psychology Labor 
day, surrounded by piles of test book 
corder to which he has been listening 
ect. 


applica- 


m some other personnel man- 
s fairly typical of that in busi- 


Samuels and Paul Sheridan are 
y distant practical applications. 
atory of Atherton University on a July 
lets. Samuels gets up from a tape re- 
and offers to show us around the proj- 


problems. When we give an ordinary 
ave many difficulties that seem to have 
Sometimes a person becomes confused 


carry out several steps in an orderly way, but 
loses his sense of direction and slips back into r 


these habitual ways of react- 
in problem solving, affecting 


al level of success, do not give 
m solving. 


plore what the important vari- 
pend a couple of years refining 
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our observation techniques before we are ready to carry out formal studies. 
Both of us teach during the school year, but we spend about a quarter of 
our time giving tests to students. During the summer we analyze the rec- 
ords, revise the tests for the next tryout, and take a few more steps toward 
a theory of problem solving." 

The first test Samuels shows us uses the same blocks Miss Kimball had on 
her table. The person tested is shown a mosaic design and asked to make the 
same design out of blocks. “In the intelligence test,” Samuels says, “the score 
reports the number of designs completed within certain time limits. The 


tester may note casual observations of the kinds of error the person makes, 


but he does not score them. We are trying to obtain dependable scores indi- 


cating how systematically the person attacks the problem, how often he re- 


peats a mistake, and how long a time passes before he notices a mistake. 
Sheridan gives the test in a room with a large mirror set in the wall, The mir- 
ror is fixed so that one can see through from the back; I sit on that side and 
observe every detail of what the subject does. I dictate a record into the tape 
recorder, We can listen to the tapes whenever we wish, and work out the 
nature and time of each error. We have developed new designs which make 
certain types of error more likely, and later we hope to develop a simpler 
scoring method which will not require a tape recorder. 

Samuels shows several other tests using mazes, anagrams, and designs 
made by building up layers of cutout colored stencils. “Our main purpose,” 
he says, “is to identify consistent patterns which the person shows on many 
different problems. These patterns are the ones we expect him to carry over 
when he writes a theme in English or tries to identify an unknown substance 
in chemistry.” . g ! 

We inquire about a piece of apparatus with a ring of lights and a few 
pushbuttons. "This," he says; “ig an experimental test which permits us to 
present much longer and more complex tasks than the usual puzzle. It is used 
to measure abilities of high-level scientific and technical workers; one needs 
very difficult tasks to separate the best men in such a group. We are using 
it with average students because they make many errors, and our main con- 
cern is to study the types of error made by different persons. The apparatus 
is wired so that it follows some simple rules. These rules change with every 
problem. There are three pushbuttons which turn on and off various com- 
binations of lights. The person’s task may be to turn on light number 8 only. 
He presses the buttons in turn to find out what lights each button controls. 
For instance, when he presses button 1, lights 8, 4, and 5 go on. When he has 
all the information, he must find a sequence of actions which will leave only 
light 3 lit. A problem of this type can be made very complicated; even a 
bright person takes thirty minutes on some of our problems. One interesting 
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because any one trial exposes only a portion of the behavior that interests us- 

The size of these sampling errors is described by the standard error of 
measurement (se), or by the “error variance" (s^). The obtained score is à 
combination of the true score and the error on a particular trial. The 
variance of obtained scores is the total of the error variance and the variance 
of true scores. 

These variances have a direct relation to the correlation between scores 
from two samples of behavior. If we let r;; stand for the reliability coefficient 
(correlation between scores), 

s? s—s^ True variance 
s s Total variance 
From the data for the TMC, we find 
Total s = 104 s? = 108.2 
Error s — 87 s^- 197 
True s? — 945 (by subtraction) 
94.5 
ta = 1082 

The reliability coefficient tells what proportion of the test varian 
to “true” individual differences, and not to sampling error. In this 
87 percent of the variance is “true” and therefore 18 percent is "error. 
what we mean by "error" is defined in part by the experimental procedure 
as we shall see later. 

Reliability and Test Length. The importance of lengthening tests is 
every question added, the sample of performance becomes a more 2 
index of performance on all possible questions. A single addition problem nt 
a very poor sample of a person's ability, since we are quite likely to prese g 
a number combination that is particularly hard or easy for him. By gaei 
more and more questions of the same general sort, we come closer to 2 8° 
estimate of his general ability on addition problems. thas 

Longer tests are also less influenced by other chance factors. If a tes rec 
only five multiple-choice items, a few people might get all the items a by 
just by guessing. In a fifty-item test, practically no one could do i m 
guessing. Variations due to guessing tend to cancel out. Three fifteen" jcal 
observations of a child's social behavior provide a poor sample of his “ture” 
behavior; thirty observations, however, should give a dependable PY to 

The Spearman-Brown formula (see Computing Guide 6) permits 
estimate what reliability the test would have if it were lengthened or nm 
ened. The formula assumes that when we change the length of the V ouo 
do not change its nature. Extreme increases in test length, howeve?? , are 
duce boredom and may reduce reliability. Furthermore, unless one 15 
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f ; 
ul, added items or added periods of observation may not cover the same be- 


havior or ability as the original test. 

One must examine the reliability of every score he intends to interpret. 
Some testers, knowing that a test as à whole is reliable, place faith in its part 
Scores also. Since short tests are likely to be unreliable, a part score based on 
: few items is of limited value. The reliability of part scores as well as of 
p scores should be given in the test manual. If this reliability is low or un- 

Own, the part scores cannot be relied upon. 

While inaccuracy lowers validity, this does not necessarily argue for 
making predictor tests very long. An increase in test length has a great 
effect on reliability but a much smaller effect on validity. The following 
on applies, where t, represents a test n times as long as test t ( Gulliksen, 

> pp. 88 ff.): 


Ttt 


nfn 


Tho = Ttc " 
Th tt : 
€ observed test-criterion correlation iS "re Under the square root sign, 
d rẹ is the reliability of the longer 


DT 
tt is the observed reliability for test t an 
Figure 22, derived from 


test, cal r 
> calculated by the Spearman-Brown formula. 
y Ene ope hortening the TMC, 


^ e formula above, shows the effect of lengthening or shor! the 
Sing r, as 40 and ry as .87. As we lengthen the test, its reliability ap- 


Proaches 1.00 according to the Spearman-Brown formula (broken line). The 
Increase in validity is expected to id line in the figure. As the 


test ; E e 
st is made longer and longer, validity ap a limit. Validity 


1.8 
Sd. that a test has a 
"i own reliability. The Spear- 
ow formula, given b 
?' fight, estimates the re- where eT 
liability E the score from a r is the original reliability; ra 
Similar test n times as long. of the test n times os long 


is the reliability 


2T A 
re Predict the reliability of a m | 2040) _ 
ori twice as long as the if r = 40, r2 = 7 4 (1) -40 
Ginal test, substitute in the 
ormula n = 2, mm 
7140 ' 


3. 5 
HM idi the original test is m— mm #40) _ 
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. Original length. The re- 
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Stimated using n — i. 


Co, 
MPUTING GUIDE s. USE OF THE SPEARMANAROWN FORMULA 


8 ESSENTIALS OF PSYCHOLOGICAL TESTING 


feature of this apparatus is its automatic recording. Every time the person 
presses a button, a record is punched on a teletype tape. This tape can later 
be decoded to show just what the person did and when.” 


Sheridan comes in at this moment with an armload of boxes that turn out 
to contain cards for use in computing machines. His role in the project, he 
explains, is to analyze the data after all the tests and records have been 
scored. "The electronic computer has been a blessing to research like this. 
We obtain about 200 scores on every student we test, and it would take for- 
ever to work out the relations on an ordinary calculator. But the electronic 
machine gives us the answers in just a few hours. The catch is that sometimes 
it takes a month to put all the records onto these c 
has to be reduced to numerical form before it can 

"Our main statistical method is factor an 
the variables which affect only a single task from the ones which show up 
consistently throughout the person's performance. We also find out which 
test scores give the best measures of each variable. The results so far suggest 
that we will eventually have dependable measures of how persistent, how 
systematic, and how adaptable the person is, 

“We are not primarily interested in practical 
can classify people according to the way they s 
to study how they get that way. Probably 
many of these errors, but we want to learn 
habitually differ from those of another, We 
in which we frustrate people in various ways 
tional stress produce different sorts of errors. 
search we have to be able to distinguish a: 

It is easy to think of applications for such 
are developing. The tests might be useful in 
selecting students for specialized training, 


ards. Every observation 
be treated statistically. 
alysis. This helps us to separate 


applications of the test. If we 
olve problems, then we want 
anxiety is an important cause of 
why one anxious person’s errors 
will eventually do experiments 
to see if different kinds of emo- 


tests as Sheridan and Samuels 
diagnosing mental patients, in 
or in analyzing students whose 
vel. Very often, tests that are de- 


he clinical Psychologist in a hospital, the 
tester preparing standardized tests for school use, the vocational counselor, 


and many others. In addition to these highly qualified investigators, we 
might pay more attention to the Miss Fields who gi 
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psychological research. There are a great variety of tests, covering many 
sorts of characteristics. Even for a single characteristic such as mental ability, 
there are many tests which have different uses. The significance of test 
Scores is greatest when they are combined with a full study of the person by 
means of interview, case-history records, application blanks, and other meth- 
ods. Tests provide facts which help us understand people; they almost never 
are a mechanical tool which can render decisions automatically. 


PURCHASING TESTS 
Who May Obtain Tests? 


Tests are useful to many professions, but in the hands of persons with in- 
adequate training they do a great deal of harm. An untrained user may ad- 
minister a test incorrectly. He may place undue reliance on inaccurate 
measurements. He may misunderstand what the test measures and reach un- 
sound conclusions. It is therefore important for the user to confine himself to 
tests that he can handle properly. 

To see the implications of this remark, consider industrial personnel testing 
as an example. To a manager it may appear simple to give a group intelli- 
gence test, score it with a punched-out key, tabulate the scores, and hire 


the best man. A personnel psychologist, however, knows that on some rou- 
tine jobs average men make better employees than highly intelligent men, 
who become bored and quit. He knows that a general mental test does not 
measure the abilities most important in many factory jobs. He knows that 
even experts make errors when they try to guess which tests will predict 
Success in a given job; a scientifically designed tryout is essential to make 
sure that the tests actually pick better employees. 

Introducing and operating an industrial testing program requires many 
different abilities: 


Analyzing the job to i 
Selecting promising tests for tryout. 
Constructing new tests when no published test is suitable. 


Planning and carrying out an experimental trial; choosing the final 
set of tests. 

Deciding how test results are to be used in selection. 

Routinely administering tests to applicants. 


Scoring. 
Interpreting 
plan. 

A great deal of training is required to perform steps 1 through 5. For most 
tests used in industry, steps 6 and 7 can be performed by an intelligent cleri- 


dentify abilities which could be relevant. 


pe 69 Exp 


go rig g 


the test and making hiring decisions within the general 
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cal worker under proper supervision. Step 8 may be a routine operation or 
may call for a decision by an executive who considers a psychologist’s recom- 
mendation along with other facts. , 

Industrial personnel workers in the United States are qualified at various 
levels: 

9 Diploma in industrial psychology. A diploma is given by the American 
Board of Examiners in Professional Psychology to an industrial psychologist 
who possesses (among other qualifications) the training and experience ad 
quired for carrying out all phases of a testing program.” A person holding 
this diploma is called a diplomate. 

€ Ph.D. in personnel psychology. A psychologist at this level (who may 
have received his training in a university department of psychology, educa- 
tion, or business management) should be able to perform all the functions 
listed above. If he has limited experience, he may need to consult a better- 
qualified person, especially in planning the program. Numerous consulting 
firms provide assistance of this type. 

9 Limited specialized training. Workers who have training in personnel 
methods equivalent to a master's degree can carry out specialized functions 
within a general plan. They can administer complicated tests, collect data 
on the performance of employees, and make some decisions about indi- 
viduals. A psychologist can train an intelligent assistant to perform such 
functions, although he must then provide close supervision.? 

9 Intelligent workers without psychological training. A person without 
psychological training can learn to administer many group tests, take charge 
of the scoring of objective tests, and apply mechanical rules for selection on 
the basis of scores. 

9 Ordinary clerical workers. Workers at this level should be used only 
for routine scoring under competent supervision, and for assisting in test 
administration. 

If we were to consider some other use of tests such as a vocational counsel- 
ing service, a school testing program, or a diagnostic service in a mental hos- 
pital, we would observe similar needs. In each of these services there is need 
for some routine handling of tests and test data, for responsible supervision, 
and for high-level planning of the total program. A testing program involves 
far more than buying a package of tests and going to work. 

The amount of specialized training required depends upon the tests to be 
used. Some tests can be administered and interpreted by responsible per- 
sons who have no specialized training. Other tests serving the same general 
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purpose can be used only by well-qualified psychologists. For example, two 
tests which might have some value in selecting men for training as junior 
executives are the Ohio State Psychological Examination and the Thematic 
Apperception Test (TAT). The former is a precise, and fairly difficult test of 
vocabulary. knowledge and verbal reasoning ability. The directions and 
scoring procedure are so simple that a careful high-school graduate can fol- 
low them. An employer with no psychological training can easily understand 
what the results mean. To administer and interpret the TAT, a person must 


have graduate training in the psychology of personality and should have ad- 
ditional supervised experience with this particular test. It is used to investi- 
ative abilities of the applicant, and the conclusions it 


gate the drives and cre 
Suggests are not highly dependable. Serious errors in judgment would re- 


sult if the test were interpreted by anyone save a cautious and able psy- 
chologist, 

The APA Code for Test Distribution. Distributors of tests try to restrict sales 
to qualified persons, just as the sale of medicines is restricted. Test distribu- 
tors check the qualifications of purchasers to determine whether they are 
able to use whatever tests they order. Severe restrictions are placed on the 
tests which are most difficult to interpret and the misinterpretation of which 
Would be most serious. 

A further reason for restriction is to prevent copies of questions from fall- 
ing into the hands of persons who will later take the test. Students would 
like to become familiar with a college entrance examination in advance, but 
this knowledge would give them an unfair advantage over other applicants. 
Parents sometimes try to help their child by coaching him on intelligence 
test items, but to the extent that their coaching succeeds, it prevents the psy- 
chologist from making sound decisions. The control system protects all le- 


Sitimate users of published tests. 


The guiding principles of the control system are set down in the Ethical 


Standards of Psychologists. This important statement was officially adopted 
by the American Psychological Association in 1950. The following para- 
graphs abstract and paraphrase the formal statement, omitting legalistic de- 
tails applying only to borderline problems (Ethical Standards, 1958, pp. 
146-148); 


Tests and diagnostic ai 
demonstrate that they ha 
effective use and interpretation. 


categories: - 
Een A. Tests or aids which can be adequately administered, scored 


and interpreted with the aid of the manual and a general orientation to 
the kind of organization in which one is working. (Examples: educa- 
tional achievement, trade, and vocational proficiency tests.) Such tests 


ds should be released only to persons who can 
ve the knowledge and skill necessary for their 
Tests can be classified in the following 
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and aids may be given and interpreted by responsible nonpsychologists 
such as school principals and business executives. 

Level B. Tests or aids which require some technical knowledge of test 
construction and use, and of supporting subjects such as statistics, indi- 
vidual differences, the psychology of adjustment, personnel psychology, 
and guidance. (Examples: general intelligence and special aptitude 
tests, interest inventories and personality screening inventories.) 

These tests and aids can be used by persons who have had suitable 
psychological training; or are employed and authorized to use them in 
their employment by an established school, government agency, or busi- 
ness enterprise; or use them in connection with a course for the study of 
such instruments. 

Level C. Tests and aids which require substantial understanding, of 
testing and supporting psychological topics, together with supervised 
experience in the use of these devices. ( Examples: clinical tests of intel- 
ligence, and personality tests.) 

Such tests and aids should be used only by Diplomates of the Ameri- 
can Board of Examiners in Professional Psychology; or persons with at 
least a master’s degree in psychology and at least one year of properly 
supervised experience; or other psychologists who are using tests for re- 
search or self-training purposes with suitable precautions; or graduate 
students enrolled in courses requiring the use of such devices under the 
supervision of a qualified psychologist; or members of kindred profes- 
sions with adequate training in clinical psychological testing; or grad- 
uate students and other professional persons who have had training and 
supervised experience in administering and scoring the test in question, 
and who are working with a person who is qualified to interpret the test 
results. 

Being a trained psychologist does not automatically make one a quali- 
fied user of all types of psychological tests. Being qualified as a user of 
tests in a specialty such as personnel selection, remedial reading, voca- 
tional and educational counseling, or psychodiagnosis does not neces- 
sarily qualify one in other specialties. Being a psychiatrist, social workers 
teacher, or school administrator does not ipso facto qualify one to use 


projective techniques, intelligence tests, standardized achievement tests; 
etc. 


The system for controlling distribution varies somewhat with the publisher. 


The major distributing firms check the name of each purchaser against the 


directory of diplomates and similar sources to determine whether his quali- 
fications are sufficient for the tests he has ordered. If there is doubt, the pul 


chaser is asked to give information about his training. The distributor may 
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ask some qualified psychologist who knows the purchaser (e.g., one of his 
former professors, or his clinical supervisor) to endorse his request. The 
publisher evaluates this information and authorizes the person to purchase 
tests up to a certain level. Because such investigations are costly, some of the 
smaller publishers have made no effective effort to control sales of their tests. 

'The ethical responsibility for restricting tests rests on the purchaser as 
much as on the distributor. A person who uses a test for which his training is 
insufficient runs the risk of making serious errors. It is essential that every 
tester evaluate his own qualifications (discussing them with a better-trained 
person if he is in doubt) and decide what tests he is ready to use. Ideally, 
professional workers would restrict their own testing by self-control, so that 
the publisher would have to concern himself only with nonprofessionals such 


às employers who believe that anyone can apply personality tests, parents 


Who want to test their children's intelligence, or job applicants who want to 
practice for tests thoy may be asked to take. 
judgment, thinking like this: “I'm not 


I'll order it, describing my training 
| will know that I'm qualified." What 


1. Sometimes a tester relies on the distributor's 
sure whether I'm qualified to use this test. 
honestly; then if the publisher sells it to me, 


is wrong with this attitude? . 
* An employer without psychological training decides to buy personality tests and 


use them on applicants. What is gained by refusing to sell him the tests, in view 
of the fact that without them he will base his judgments entirely on superficial 
impressions gained through an interview? 

Examine two or three publishers’ catalogs to see what statements are made 
about restriction of sale. Are the restrictions uniform? Do they follow the APA 


code exactly? 
* Classify the following tests accor 
a. A mechanical aptitude test re 


(e.g., a mousetrap) as fast as possible. d 
b. The Strong Vocational Interest Blank is an objectively scored questionnaire. 


C. A test of arithmetic computation is intended for screening store clerks, cash- 

iers, and similar employees. 

d. A diagnostic oral reading test calls for careful observation of the pupil's er- 

rors, self-confidence, method of attacking unfamiliar words, etc. 

What is meant in the code by the phrase "with suitable precautions"? 

* The code does not authorize distribution of tests to people who wish to assess 
their own aptitudes, skills, or personality characteristics. What are the reasons 
for this policy? 

* Most American tests are distributed through publishers to anyone who is quali- 
fied and wishes to buy them. Another system is found in various national em- 
Ployment services and youth agencies, especially in Europe. Each counseling 
Service devises a special set of aptitude tests for its own use. Only the counselors 
employed by this agency are allowed to use the tests. What are the advantages 
and disadvantages of this type of control, compared with the usual type of dis- 


tribution? 


ding to the levels of the APA code: 
quires the person to assemble simple objects 


t 
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Sources of Information About Tests 


A first step in looking for tests is to consult the catalogs of major test pub- 

" lishers. Except for a few tests obtainable only from smaller firms, the im- 
portant tests are distributed in the United States by five companies: Cali- 

fornia Test Bureau, Educational Testing Service, Psychological Corporation, 

Science Research Associates, and World Book Company. The person need- 


(a) @ Mechanical Comprehension Tests 


Georce K. BENNETT, ef al. 


Designed to measure ability to understand 
mechanical relationships, these tests. consist 
of drawings with simply phrased questions 
about them. The effects of special environ- 
ment and of rote memory of physical laws 
are minimized. Useful in selecting personnel 
for mechanical work and for selection of 
students for technical and engincering train- 
ing. 


Form AA has norms for a large variety of 
school and industrial groups. Appropriate 
for general population testing; for more 
highly selected groups, use Forms BB or CC. 


Form AA-F is identical with Form AA ex- 
cept that instructions and questions are in 
both English and French, for use with French- 
Canadians. 


Form AA-S is the Spanish language edition 
of Form AA; preliminary norms from Cuba. 


Form BB is more difficult than Form AA. 
Norms based on ten groups of students, appli- 
cants, and employed technicians and engineers. 


FonM BB-S is identical with Form BB except 
that instructions and questions are in Spanish; 
preliminary norms from Venezuela. 


Form CC (Owens-Bennett) is more difficult 
than Form BB and yields a wider range of 
scores at high ability levels. Norms are based 
on enginecring students. 


Form W1 (Bennett-Fry) is the women's form 
of this series. Norms are based on high school 
freshmen and senior girls and several occupa- 
tional groups of women. Difficulty level is 
between AA and BB. 


High school and above. Time: no limit, 
about 30 min. Arranged with the test 1D 
reusable booklets and with separate IB 
answer sheets, which may be scored either 
by hand or by machine. 


Order booklets and answer sheets separate- 
ly, specifying form and quantity of eacb. 


Booklets, sold in packages of 25 with manual 
and scoring stencils. 


1-9 packages $4.50 each 
10 or more packages 4,00 each 
Single copies 25 cents each 


Answer Sheets. Specify Form AA, BB, cc, 
W1, AA-S or BB-S (AA-F uses regular 
answer sheet.) Sold only in packages of 50, 


$1.90 each, and packages of 500, $16.00 each. 
Specimen Set, 50 cents. Specify form desired. 


Spanish forms AA-S and BB-S together i? 
one specimen set, $1.00. 


ing to purchase tests should therefore obtain the current catalogs of these 
firms, and of other publishers likely to have tests in his field of interest.* 

The catalog lists and describes tests. Most of the catalogs indicate clearly 
what level of training is required to use each test, and who may purchase it. 
The publisher’s recommendation should be viewed conservatively; in some 
instances the publisher indicates that a test can be used by a purchaser 
with limited training, even though testing authorities would favor a stricter 
standard. 

Just what information the catalog itself can provide is illustrated by the 
excerpt above describing the Bennett tests, which we shall discuss fully in 


4 Addresses of publishers are given in the Appendix. 
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Chapter 3 and subsequently." (The first symbol in the excerpt is a code letter - f 
—a—indicating that this falls in the least restricted category of tests. Any 
recognized business or industrial firm may purchase this test for use in per- « 
sonnel selection, even if there is no qualified psychologist on its staff.) e | 


Tests may be suggested by several additional sources, particularly the 


Mental Measurements Yearbooks (see p. 101). 

Before a decision to purchase the test for use is made, a detailed study of 
its manual is needed. Whereas the catalog description is only a paragraph 
long, the manual offers several pages of information on the purposes to 
which the test is best suited, methods of administering and interpreting it, 
and its limitations. Sometimes the part of this information which describes 
the research basis of the test is placed in a technical handbook, leaving the 
less technical description for the examiner's manual. If the manual is divided 
in this way, both parts should be consulted. A “specimen set” of a test is a 
package including a manual, test booklet, and scoring key. Most universities 
and many school systems and counseling centers maintain collections of 
specimen sets for the use of students and professional staff. In addition, spec- 


imen sets may be purchased directly from the publisher. 


Suggested Readings 


Benton, Arthur L. Cerebral disease in a child. In Arthur Burton & Robert E. Harris, 


Clinical studies of personality. New York: Harper, 1955. Pp. 600-611. 
A difficult problem in psychodiagnostics is simply presented. A 9-year-old 
was referred because of emotional and school problems. Test performance on 
the Stanford-Binet scale and on special drawing tests showed great vari- 
ability in mental functioning. Interpretation of the performance led to a 


diagnosis of brain disease, subsequently confirmed by an operation. 
Crutchfield, Richard S. Conformity and character. Amer. Psychologist, 1955, 10, 


191-198. (Reprinted in Don E. Dulany, Jr. & others, Contributions to modern 


psychology. New York: Oxford University Press, 1958. Pp. 293-307.) 
In an illustration of the use of test procedures to advance scientific knowledge, 
form to the opinion of one’s group is 


an experimental test of readiness to conforn 
described. Results show the relation of this tendency to personality and to 
the nature of the group- 
awson, Douglas E. Need for sa 
J. educ. Psychol., 1944, 35, 240-247. 
Errors are made when teachers W. 


interpret tests of mental ability. : R 
Ogg, Elizabeth. Psychologists in action. Public Affairs Pamphlet No. 229, 1955. 


This description of the roles psychologists play is written for laymen and for 


those considering careers in the field. . : . 
Super, Donald B. A case study in exploration: curricular and occupational. The 


Psychology of careers. New York: Harper, 1957. Pp. 92-100. 


feguarding the field of intelligence testing. 


ith inadequate training administer or 


* From the 1959-1960 catalog of the Psychological Corporation. 
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A typical problem in counseling an adolescent girl who is uncertain about 
possible careers is described. Information from aptitude, interest, and achieve- 
ment tests is combined with the girl's own statements and her school record 
to help her to a greater degree of self-understanding. 

Swanson, Wendell M., & Lindgren, Eugene. The use of psychological tests in 

industry. Personnel Psychol., 1952, 5, 19-93. 

A questionnaire survey of firms in Minneapolis and St. Paul gives a realistic 
summary of the testing programs in use for selecting employees. 


Purposes and Types of Tests 


DECISIONS FOR WHICH TESTS ARE USED 
ANYONE who works with people is continually making decisions. A per- 
a teacher decides whether each pupil 


s : i 
onnel manager decides whom to hire; 
des how a patient should be 


1s ready for long division; a physician deci 

£ : ‘ ; : 

treated. If the decision maker obtains better information before making his 
decision he will have a better chance of attaining the results he desires. 

All decisions involve prediction. Any test tells about some difference 


among people's performances at this moment. That fact would not be worth 


owing if one could not then predict that these people will differ in some 
ance at some other time. 


other performance or in the same perform 
Consider a test of visual recognition. We flash a row of letters on the screen 
for an instant, and the person reports what he has seen. Some people recog- 
Nize four letters; others grasp seven in the same brief interval. This differ- 
ence is intriguing, but it is unimportant until it can be related to some other 
behavior, The applied psychologist sees that this task possibly has something 
In common with airplane recognition and with perception in reading. He in- 
Vestigates whether the flash-recognition test will predict success in these 
Practical activities. If so, it can assist the armed forces to select lookouts, or 
help the primary grade teacher to plan reading instruction. 
n clinical use of tests also. A clinician might use 
a person has especial difficulty in perceiv- 
and failure, that being a possible indi- 
test is useful only if the unusual score 
time in the future. The clinician 
adjustment if that were only an in- 
t. The significance of the clinical 
permit one to predict behavior 


un is involved i 
] ash technique to see whether 
ng emotionally toned words like guilt 
i of emotional disturbance. Such a 
9reshadows deviant behavior at some 
Mais not need to detect emotional mal 
ion. condition which could never crop ou 

hinges on the fact that certam responses 


whic 

vhich should be forestalled or encouraged. 

The scientific investigator may not care whether the tests he uses have 
o 
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value for practical decisions. He may not even be interested in individual 
differences. But he too must have tests which predict. The flash test is a 
good laboratory measuring instrument because its scores are stable. If condi- 
tions are not altered, a person makes about the same score each time he is 
tested; thus today's test predicts tomorrow’s score. If the score changes when 
the experimenter changes the illumination, we know that the change re- 
sulted from the illumination and not from chance variation. The experi- 
menter therefore can study systematically how flash perception is related to 
illumination. When this relation is fully understood, he has a general law 
which predicts what changes in perception will accompany changes in il- 
lumination. If the test were not able to predict tomorrow's performance from 
today's (other things being equal), it would be of no use to the experimental 
psychologist. 
1. Demonstrate that prediction is intended in each of the following situations: 

a. A foreman is asked to rate his workers on quality of work. 

b. Airlines require a periodic physical examination of pilots. 

c. A psychologist investigates whether students are more "liberal" in their at- 

titudes toward birth control after two years of college study. 
d. A teacher gives James a grade of C in algebra and Harry a grade of A- 


2. Tests are used to obtain information which will permit sounder decisions. Does 
this statement apply to the Gallup public opinion poll? 


Selection 


Tests aid in making many sorts of decisions, including selection and classi- 
fication of individuals, evaluation of educational or treatment procedures, 
and acceptance or rejection of scientific hypotheses. We shall consider briefly 
each of these types of decision, beginning with selection. 

Tn a selection decision, an institution decides to accept some men and to 
reject others. Hiring an employee is such a selection decision. The distin- 
guishing feature of the selection decision is that some men are rejected, and 
their future performance is of no concern to the institution. A person may be 
“selected” and “classified” at the same time. 


Classification 


In classification, we decide which of many possible assignments or treat- 
ments a person shall receive. Examples: The college student asks a counselor 
to help him choose the best curriculum. The Navy tests each recruit to de- 
termine whether he should be assigned to the engine room, the chartroom, or 
the gun turret. The schoolboy who reads poorly is given a series of tests tO 
determine what method of remedial instruction he needs, and whether he 
should first have some other treatment (eyeglasses, psychotherapy, etc.)- 
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One important classification problem is diagnosis of mental patients. This 
may seem like an attempt merely to find the right name for a patient's dis- 
order, but it really is a choice among treatments, since the patient's label de- 


termines what treatment he gets. 

Where people are assigned to different levels of work (rather than to dis- 
tinctly different types of work) we have a placement decision. Placement is 
a special case of classification. "Placement tests" are used to allocate college 
freshmen to the proper section of English, i.e., to the appropriate treatment. 
Choosing officer candidates from among enlisted men is a placement deci- 
Sion rather than a selection decision, since the men not chosen as officers re- 
main in the army and are used in a different way. 

A sharp distinction between classification decisions and selection deci- 
Sions is required because a test which is useful in making one type of deci- 
(Cronbach and Gleser, 1957). A test which 


sion may not help with the other ^ 
detects serious emotional disturbances would be very useful in keeping un- 
ght not help at all, on the 


Stable men out of the Army (selection). The test mi 
other hand, in deciding how to treat men who break down in the service 


(classification). As we shall see in Chapter 12, one interprets validity data 
quite differently for classification and selection purposes. 
Testing often leads to a description of the person, which can be far more 


individualized than a simpler classification. For instance, a test battery plus 


other facts might classify a student as a promising engmeer, and this would 
A description would report in 


lead him to a decision to enroll in engineering: on wou 
Addition the many particular assets and liabilities that distinguish this stu- 


dent from other prospective engineers. He is especially interested in avia- 
tion; he has a rather immature and uncoüperative attitude toward superiors; 
he works energetically in short bursts, with no long-range scheduling. All 
these facts are useful to the counselor. Each one bears on a different decision 
about course planning, about disciplinary treatment, about advice on study, 


and so on. 
When a test is used descriptively, we do not confine ourselves to one defi- 
d all important facts so that they will be 


nite question. Rather, we try to recor : ee 
available when questions about treatment arise. A description may catalog 


a student's interests, describe his personality pattern, or Lorena aide ai 
his knowle dge abt his major field. The description is multidimensional and 
elps us resolve many different questions about how to treat the person. 


Evaluation of Treatments 

So far we have considered only decisions about individuals. Tests are 
equally important as an aid in evaluating treatments. When the teacher 
gives an arithmetic test, he is testing his instruction as much as he is testing 


¢ 
\ 
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the students’ effort and ability. If the results are poor, he should probably 
alter his method. When more than one instructional method is under con- 
sideration, an experimental comparison can be made; a test shows which 
method gives the best results and should be used hereafter. 

In industry, questions about treatment or management can be decided by 
suitable tests. The effectiveness of training is judged by performance tests. 


Supervision and personnel policies can be judged by tests of attitudes and 
morale. 


Verification of Scientific Hypotheses 


The functions discussed above illustrate the usefulness of tests in making 
decisions of immediate practical importance. Tests are also used extensively 
to measure outcomes of scientific experiments, as was illustrated in our ear- 
lier discussion of the measurement of flash perception. The experimenter is 
not making decisions about particular individuals. He is trying to decide 
whether to accept or reject a particular hypothesis (such as, “The change of 
perceptual span with change in illumination is greater when a subject is un- 
der stress”). Tests provide a more objective and dependable basis for com- 
parisons than do rough impressions. 

Sometimes the investigator uses tests published for practical purposes, but 
a test tailor-made to fit the experiment will often work better. In one study, 
for example, the experimenter played phonograph recordings of words 
backwards, in order to study how people learn to recognize strange stimuli 


(Lewis, 1946). Such a task, just because it is novel, makes a very good experi- 
mental test. 


3. Show that a reading test might sometimes be used by college counselors or ad- 
ministrators for each of the four types of decision listed above. 
Classify each of the following according to the type of decision represented: 
a. A foundling home measures intelligence of a child and uses this as a basis 
for deciding which home to place the child in. 
b. An instructor rides with a pilot at the end of his training, and fills out a 
checklist to show which maneuvers he performs correctly. 
. A psychologist compares the average intelligence of only children with that 
of children from larger families of similar social background. 
d. All applicants for a driver's license are tested. 
- A test is given in a junior high school for the purpose of identifying adoles- 
cents likely to become delinquent. 
A university class is divided in two parts, one of which sees the lectures and 
demonstrations by television, while the other hears and sees the instructor 
directly. Both groups are given the same examination. 
5. Education and psychotherapy are both learning experiences, yet tests are used 


much more often for routine evaluation in school than in therapy. What reasons 
can you suggest? 


4. 


ERE 
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6. Describe one circumstance where tests might be used descriptively by: 
a. An employment manager. 
b. A social worker dealing with children. 


C. A teacher of typewriting. 
7. When tests are used to obtain a description, it can be said that a classification 


decision is being made. Explain. 


WHAT IS A TEST? 


The layman is likelv to think of a test as a series of questions requiring a writ- 
ten or oral ANSWER, Psychological tests are, however, extremely varied, and 
the variety is steadily growing. Perhaps the best definition to cover the range 
of tests described in this book is as follows: @ test is a systematic procedure 
for comparing the behavior of two or more persons. We shall not give atten- 
tion to unsystematic, spur-of-the-moment procedures for sizing up a person 
—casual conversation, for example. 

We shall examine a large number of principl 
number of criteria for deciding whether a test is satisfactory. Perhaps WE a 
should define test so as to include all the procedures to which these criteria 


and principles apply. If we did this, however, we would have to extend the 
avior and measures of nonbehav- = 


es regarding tests, and a large 


- mEq 


son AWO? 


definition to cover measures of animal beh 


lorial characteristics. For example, to determine 
behavior of animals, it is necessary to measure their activity before-and-i 


after, and the procedure has to satisfy the same logical requirements a À 
does any test of human behavior. In one study (Isaac and Ruch, 1956), the; 
Investigators believed that spontaneous movement of monkeys would be; 
affected by radiation. To measure this effect, they tried four techniques: + 
rating by an observer, recording from a photocell pointed across the cage, 
and two methods of recording the movements of the cage floor, which was 
Suspended so that it vibrated when the animal moved. Determining the 

Sst technique is just like choosing among educational and clinical tests; 
the experimenters had to apply the very indices of reliability and test inter- 
Correlation which we shall study in later chapters. Thus, while this book is 
Most concerned with tests used to study differences between people, much 
of the material is significant for the animal experimenter, for the sociologist 
Comparing communities, or for any other behavioral scientist. 

Our definition includes measurements using apparatus, laboratory pro- 
cedures for observation of social responses; questionnaires for obtaining re- 
Ports on personality, and systematic records collected on an falepr 
duction line. The reader is warned, however, that map. défh ions’ dge 7 
are in current use, varying with the writer's purpose. Some waiters s Testit ^ 
the word test to measuring instruments, but we shall nòt. A true measuring> \ 
Instrument is osed to assign to every person a number which Jooates him'— i 

j 


how atomic radiation aec. i 


lel es 5 
ie es 6 $4 J I ; = i "jJ 
a _ AN “Up M ri 
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on a scale of equal units, as we do when we report height in inches. Not only 
do psychological tests give less perfect measurements, in this sense, than do 
instruments used in other sciences, but many useful devices do not “measure” 
at all. In particular, some personality tests yield a verbal description instead 
of summing up the person by means of scores. 


Standardization 


A distinction between standardized and unstandardized procedures grew 
up in the early days of testing. Every laboratory in those days had its own 
method of measuring memory span, reaction time, and so on, and it was dif- 
ficult to compare results from different laboratories. It was likewise difficult 
for school officials to answer such practical questions as whether pupils were 
learning to spell as well as could be expected, when every teacher used a 
different test. Standardized tests were designed to overcome these prob- 
lems. A standardized test is one in which the procedure, apparatus, and scor- 
ing have been fixed so that precisely the same test can be given at different 
times and places. 

Some tests are provided with tables of norms stating what scores are usu- 
ally earned by representative subjects. Tests having such norms are some- 
times called "standardized tests,” and the process of gathering norm data is 
called "standardization." We are not using the word standardized in that 
sense, because we wish to emphasize standardization of procedure. A test 
may have a table of norms even though its procedures are not clearly speci- 
fied, and a test with well-standardized procedures may not have norms. 
Obviously, collecting norms is not profitable until procedures are well stand- 
ardized. 

The first major step toward standardization of psychological testing came 
in 1905, when a committee of the American Psychological Association de- 
fined procedures (e.g., for testing memory) which could be followed in all 
laboratories. Today, most of the published tests with which American ap- 
plied psychologists and teachers operate are carefully standardized. In per- 
sonality assessment, however, a number of quite unstandardized procedures 
are in general use. 

Standardization has a place in all research. In experimental psychology: 
standardization is not yet as well accepted as in testing, but the need for 
standardized procedures is much the same. These remarks of Underwood 
and Richardson (1956, p. 84) regarding concept-formation experiments give 
arguments for standardization which apply equally well to tests: 


. . . tasks or materials which have been used are quite diverse in na- 


ture. With few exceptions (e.g., Weigl-type card sorting) no systematic 
series of experiments has been built around a single task. While this lack 
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of task standardization attests to the ingenuity of individual workers in 
constructing new materials, the situation may not be entirely satisfac- 
tory for efficient development of laws and theories. In the more highly- 
developed areas in psychology only a few basic tasks, procedures, or 
materials have been used. Thus, classical conditioning, the Skinner box, 
nonsense syllables, the pursuit rotor (to mention a few) all have had 
widespread use. While some may justifiably raise questions concerning 
generality of findings based on such a limited number of procedures 
and tasks, it cannot be doubted that interl 
continuity is greatly facilitated by the use of common basic tasks and 


procedures. 


aboratory communication and 


ith which they are standardized. Printing 


the questions and mass-producing the equipment assures uniformity in those 
respects, but the directions to the subject are not always worked out in com- 
Plete detail. Every condition which affects performance must be specified if 
the test is to be regarded as truly standardized. Thus for a test of color- 
matching ability, one needs to use uniform color specimens, to follow uni- 
form directions for administration and scoring, and also to use precisely 
the right amount and kind of illumination. If standardization of the test 
Were fully effective, a man would earn very nearly the same score no matter 
who tested him or where There are, however, many difficulties in completely 
standardizing the tester’s procedure and the subject's attitude, some of which 


will be discussed in Chapter 8. 


Tests vary in the completeness w 


Obiectivity 
Tests vary in their degree of objectivity. A fully objective test is one in 
which every observer or judge seeing a performance arrives at precisely the 
Same report. To do this, he must pay attention to the same aspects of the 
Performance, record his observations to eliminate errors of recall, and score 
© record by the same rules. The objectivity of the procedure may be 
n the final scores assigned by two 


Judged by the degree of agreement betwee 
independent observers. The more subjective the observation and evalua- 


" 

lon, the less the two judges agree. . 
Tests in which the subject selects the best of several alternative answers 
€g., true-false, multiple-choice) are referred to as “objective tests,” be- 


c duy 
ause all scorers can apply a scoring key á 
Contrast, an ordinary essay test allows room for great disagreement among 


Serer instructi the observer or scorer, free-response tests 
5. By careful instructions to 


and observations can be made fairly objective. 


and agree perfectly on the result. In 


e Judge each of these statements true or false and defend your answer: 
9. Batting averages are objectively determined. 
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b. The 220-yard low hurdle race is a standardized test. 

c. A teacher has each member of the class read the same article in a current 
magazine. Time is called at the end of three minutes, and each pupil marks 
the place where he is reading. He then counts the number of words read 
and computes his reading rate in words-per-minute. This score is compared 
with a table of average reading speeds for typical magazine articles. This 
test is highly objective. 

d. The test described in c is standardized. 

9. Psychological tests often start from very crude procedures. Psychologist X thinks 
that he obtains useful information by laying a sheet of paper on the table at 
arm's length from his subject and asking him to touch with his pencil exactly 
in the center of a circle printed on the paper. The subject is told to withdraw 
his hand and repeat the movement, as rapidly and accurately as possible, until 
he is told to stop. Psychologist X gives the man a mark from 1 to 10 on each 
of the following qualities: speed, carefulness, and persistence. 

a. What changes would improve the objectivity of the test? 

b. What aspects of the procedure would need to be taken into account in 
standardizing the test? 

10. Industrial morale surveys often use questions made up by the plant personnel 
office or its consultants. What advantages and disadvantages would there be 
in using the same standardized questions in many different plants? 

11. The Kohs Block Design Test (see Figure 5, p. 42) is one of the most popular 
testing procedures. The subject is required to construct a pattern from colored 
blocks to match a printed sample. The test is chiefly used in child guidance; 
clinical diagnosis, and measurement of intelligence of persons who do poorly 
on verbal tests. It is also used for research on frustration and on cultural differ- 
ences. At least twenty versions of the test (different items, different scoring 
rules, etc.) are used in different clinics and different countries. What are the 
possible advantages and disadvantages of this diversity? 

12. The Kohs test was first published as a long series of carefully chosen items. 
Why do you think so many different versions now exist in different countries, 
even though the test is used for the same purpose in these places? 


Psychometric and Impressionistic Testing 


There are two philosophies of testing, growing from different historical 
roots and fostering different types of test procedure and interpretation; both 
are mingled in contemporary practice. While we cannot discuss these dif- 
ferent approaches exhaustively, especially in this introductory chapter, we 
can survey the main characteristics of each. 

Psychometric testing obtains numerical estimates of single aspects of per- 
formance. Its ideal is expressed in the famous dicta of E. L. Thorndike that 
“If a thing exists, it exists in some amount,” and “If it exists in some amount, 
it can be measured.” One can observe in this statement a hidden assumption 
that the psychologist is concerned with “things,” i.e., with distinct elements 
or traits which have a real existence. All people are considered to possess 
the same traits (e.g., intelligence, or mechanical experience), but in differ- 


PURPOSES AND TYPES OF TESTS 25 


ent amounts. This view of psychological investigation takes its cue from 
physical science, which identifies common aspects of dissimilar objects and 
describes any object by numbers representing such abstract dimensions as 
weight, volume, and intensity of energy of a certain wave length. 

The second approach leads to a comprehensive descriptive picture of the 
individual. We shall refer to this style of investigation as impressionistic. 
Impressionistic psychologists think that understanding another person re- 
quires a sensitive observer who looks for significant cues by any available 
means and integrates them into a total impression. Studying one trait or ele- 
. no substitute for considering the person as a 


ment at a time is, in their view 
ed with knowing “how much” of some 


Whole. The impressionist is not satisfi 
ability the person has; he asks how the subject expresses his ability, what 
kinds of errors he makes, and why (see Barron, 1957). 

To evaluate a subject's background, for example, a psychometric tester 
would have him respond to a biographical checklist covering experiences 
which many people have and which are likely to be important in their de- 
velopment. (For example: “Were you a Boy Scout patrol leader?”) He 
would score responses objectively by counting the number of items checked 
In such categories as "Interest in sports" and "Leadership experience. The 
impressionist, on the other hand, would ask for an autobiographical essay, 
Perhaps setting no more definite task than “Please write your life story on 
these pages.” From the response, he could see what the subject considers im- 


Portant about himself, what emotional tone he uses to describe his past, and 
A 


what unique experiences he has had—experiences the checklist would not 


Cover. The free response may give Jittle information on important areas cov- 
‘red thoroughly by the checklist, but it covers matters the checklist ignores. 
Each approach has merit, and each has its special limitations. Both have 
Contributed to the development of present practice, and neither style can be 
adopted to the exclusion of the other. The measurer must fall back upon 
Judgment whenever he applies information from scores in teaching, therapy, 
°F supervision of employees, and the portraitist cannot ignore the accurate 
acts psychometric instruments provide. There are several differences be- 
een the psychometric and impressionistic schools; a particular testing pro- 
cedure may follow one school on one point and another on the next. The 
Styles difer with respect to definiteness of tasks employed, control of re- 
Sponse, objective recording of basic data, formal numerical scoring and 
numerical combination of data to reach decisions, and critical validation of 
Mterpretations, 
. Definiteness of Task. The test designer decides how definitely the task 
1$ to be explained to the subject. In some tests, such as the biographical essay 
Mentioned above, the subject is free to employ any style and any content he 
Chooses, On the other hand, a questionnaire in which the subject is to check 


26 ESSENTIALS OF PSYCHOLOGICAL TESTING 


each activity he has engaged in during the past five years leaves little or no 
room for individual interpretation. 

A test is said to be structured. when all subjects interpret the task in the 
same way. The more latitude allowed, the less structured the test is. Of spe- 
cial interest are projective tests, which ask the subject to interpret a stimu- 
lus that has no obvious meaning. For instance, he may be shown an inkblot 
and told to report what it looks like to him. If he asks how many ideas to re- 
port, whether to use the same portion of the blot in two ideas, or any other 
such question, he is told, "That's up to you." 

Structuring the task controls the performance so that all subjects are 
judged on very much the same basis. It therefore permits a definite answer 
to a question formulated in advance (e.g., how much experience with small 
boats has the subject had?). The less structured technique allows greater 
variation in responses and in that sense reveals more individualized re- 
sponse patterns. (The subject's essay may, for example, give information on 
some unusual interest, such as training dogs for show, but may tell nothing 
about boating experience.) 

Recognition vs. Free Response. Most tests can be designed either in a free- 
response or in a recognition form, which allows greater control of re- 
sponses and makes scoring less impressionistic. In a mental test, series-com- 
pletion items (75869 . . .) and verbal analogies (wolf is to cub as cat is to 
) may be left in free-response form, or the subject may be offered al- 
ternative answers from which to choose. 

The psychometric tester generally prefers the recognition test because it 
can be more objectively scored, does not depend on fluency or expressive 
skill, and is less subject to misinterpretation of questions than the free-re- 
sponse form. Many testers, however, prefer the free-response form. The 
most important reason is that the free response permits observations which 
illuminate the scored aspect of performance. If a student writes out a long- 
division problem, for example, the tester can judge his neatness and his or- 


ganization of work, and perhaps can base di 
rors he commits. 


Product vs. Process. A principal difference between psychometric and 
impressionistic testing is that the former concerns itself with the tangible 
product of the performance—the answer given, the block tower constructed. 
or the essay written. When a psychometric tester does pay attention to the 
process of performance, he arms himself with a record sheet for tabulating 
what the subject does. The impressionistic tester, however, watches the sub- 
ject at work in order to form a general opinion; this general impression is in- 
deed the basic datum with which the psychologist works. In describing their 
military testing during World War II, “German psychologists [who carry the 
impressionistic style to extremes] stated repeatedly that observations of the 


agnostic conclusions on the er- 
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candidate's behavior during a test were more important than the actual score 
which he earned. . . . One man . . . said that the chief fault of inexperi- 
enced military psychologists was that they attached too much weight to 
Objective scores and did not pay enough attention to the formation of an 
intuitive impression from observation of the candidate's reactions and ex- 
Pressions. Individual examiners were permitted and often encouraged to 
Vary testing procedures and to emphasize their favorite tests" ( Fitts, 1946). 

Analysis of Results. It follows that formal scoring plays a large part in the 
Psychometric test and a very minor part in the work of the impressionistic 
tester. American devotion to the numerical score sometimes goes to such ex- 
tremes that a tester reports nothing about a child but the IQ calculated for 
him, discarding all the other information obtained in an hour of close obser- 
Vation. The thoroughly impressionistic tester may in his turn translate a test 
Performance into a character description without ever counting up a score. 
Preferably, in individual testing, both scores and descriptive information are 
taken into account. 

When a decision is to be made, one can apply some formal rule to the var- 
ious facts or can combine them impressionistically. For example, a teacher 
May assign a course grade by strictly averaging the tests, a may form an 
Overall impression that this student is "doing B work even if he did slump 
at the end" and that one is “not really as good as his tests suggest." The psy- 
Chometric tester tends to prefer the impersonal procedure, while the im- 
Pressionist thinks an informal method is more flexible and realistic. 

The psychometric tester's insistence on numerical scores influences his 
Choice of tests, Some testers bombard the subject with one test after another, 
Seeming to have almost a mystical faith that the accumulation of: numbers 
will provide all the information needed to solve his problems. In this concen- 
tration on measurable variables the tester may ignore equally pertinent as- 
Pects of the individual for which no scorable instruments have been de- 
Veloped. It is easy, in child guidance, to obtain measures of ability, and 
airly adequate instruments exist for obtaining an “emotional adjustment 
Score. These scores, however; tell only a small part of the story, and the psy- 
chologist should certainly go on to investigate the child's image of his 
Mother, his father, and his teacher, and what activities in his life give him 
the Breatest satisfaction, even if none of these questions can be answered by 
à number on a scale, or taken into 2 statistical formula. g 

Emphasis on Critical Validation. Finally, we come to the question of critical 
validation. Psychometric testers are taught to distrust judgments based on 
tests and observations. Ideally, a psychometric tester accompanies every 
Numerical score with a warning regarding the error of measurement, and 
€Very prediction with an index showing the probability of its coming true. 
The impressionist is less likely to carry out formal validation studies, often 
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being satisfied to compare impressions based on one procedure with im- 
pressions gained from another. Validation of qualitative interpretations and 
"portraits" is much more difficult than validation of scores and requires a 
greater readiness for self-criticism on the part of the psychologist. 

The most critical issue, indeed, between psychometric and impressionistic 
testing is that of confidence in the psychologist. Those who develop and 
advocate rigorous psychometric procedures regard the tester as a source of 
bias tending to obscure the truth. Those who prefer less structured proce- 
dures regard the observer as a sensitive and even indispensable instrument. 
The impressionist does not deny the danger of bias and random error in 
judgment. He, however, fears that narrowing one's focus to what can be rep- 
resented in a numerical score on a standard procedure throws away most of 
the psychologically important information. The gains from intuitive observa- 
tion and interpretation, he believes, more than offset the errors it introduces. 

Most testers occupy an intermediate position—intermediate between ob- 
session with scores and unrestrained use of intuition. Formal, strictly objec- 
tive procedures are normally combined in some manner with judgment, 
everywhere save in mass classification programs such as military processing: 

The impressionistic style assigns great responsibility to the test inter- 
preter. He must be an artist, sensitive to observe and skillful to convey his 
impressions. Some psychologists are presumably much better judges of per- 
sonality than others. The psychometric method seeks procedures which 
everyone can use equally well. The objective test is a camera pointed in à 
fixed direction; every competent photographer should get the same picture 
with it. Thus psychometric testing aims to reduce analysis of individual dif- 


ferences to a routine technical procedure. To the extent that it succeeds, it 


reduces the need for an authoritative, *wise" professional psychologist. A sim- 


ilar conflict between the technical and the artistic ideal is found in medicine. 
Laboratory tests assume more and more of the burden of medical diagno- 


sis, yet doctors have great respect for the legendary genius who diagnoses 
unerringly the malady overlooked by the tests. 


13. "Psychometric testing trusts the judgment of the test constructor, where it is Un" 
willing to trust the tester." Is this a defensible statement? 
14. Distinguish between structured and standardized. 


15. In what respects are the following procedures unstructured? 


a. In the Ayres handwriting test, pupils are told to write the Gettysburg Ad- 
dress neatly, doing as much as they can in a fixed time. 


In the Draw-a-Man test of mental development, the child is told to “draw 
the best man you can." 


b. 


In a recorded pitch-discrimination test, the subject hears two tones and re- 

sponds H, L, or N, according as the second 

than, or no different from the first. 

16. What are the advantages and disadvanta 
compared to the essay? 


tone appears higher than, lowe? 


ges of the biographical checklist es 
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17. Is the issue between psychometric and impressionistic testers one that can be 
settled by suitable factual research? 


CLASSIFICATION OF TESTS 


Tests might be classified in many ways—according to form, purpose, content, 


and other characteristics. We shall place tests in two classes, the first being 
those which seck to measure the maximum performance of the subject. We 


use these when we wish to know how well the person can perform at his 


best; they may be referred to as tests of ability. The second category includes 
those tests which seek to determine his typical. performance, i.e., what he 
or in a broad class of situations. Tests of 
Personality, habits, interests, and character fall in this category, because 
characterizations like “shy,” "interested in art,” and “anxious when in dis- 
( ibe the individual's typical behavior. 


is likely to do in a given situation 


agreement with a superior" descr 


Tests of Ability 


The distinguishing feature of a test of ability is that the subject is en- 
Couraged to earn the best score he can. An ability is a response subject to 


Adult 1 
Year V 


ect is required to trace the correct path through a 
a blind alley. He is then given further 
ets one maze correct he goes on to a more difficult one, 
ticular level. (Copyright 1933, The Psychological 


Two of the Porteus mazes. The subj 
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voluntary control (McClelland et al., 1958, p. 206). Naturally, the adequacy 
of the test depends upon the degree to which the person is motivated, i.e., 
upon his willingness to demonstrate his ability. The goal of the tester is to 
bring out the person's best possible performance. 

Notice that we define ability tests in terms of what the tester is trying to 
learn rather than by describing the test itself. A test intended to reveal maxi- 
mum performance sometimes fails to do so (for example, when the subject 
becomes too tense to perform well). Moreover, the same testing procedure 
can be used either to measure ability or to study typical performance. For 
example, although the Porteus maze (Figure 1) can be scored solely in 
terms of speed and correctness, it also permits the tester to observe how 
much foresight and planning the subject uses. Since any test performance 
depends on both ability and personality, our classification is somewhat arbi- 
trary. 

Some ability tests measure performance on familiar tasks: for example, a 
road test for a driver's license. Others require the person to do something 


completely unfamiliar. The Complex 
Coórdination Test requires a person 
who has never flown a plane to op- 
erate a "stick" and “rudder bar" just 
as if he were flying. Flashing lights 
signal for certain movements. If he 
can follow the directions and make 
the necessary coérdinations he gets 
a high score. This task reproduces 
one aspect of the flyer’s job; other 
things being equal, a person who is 
superior on this test will be a su- 
perior pilot. 

Tests measuring maximum per- 
formance are referred to as mental 
tests, intelligence tests, etc. We shall 
not define these terms formally; in- 
FIG. 2. The Complex Coérdination Test. deed, most of the terms have no 


well-established definition. One large 
group of tests we shall refer to as measures of general mental ability. They 


seek to measure those mental abilities which are valuable in almost any type 
of thinking or learning. Tests of this sort are often called "intelligence tests,” 
but that name leads to controversy because “intelligence” has so many mean- 
ings. General abilities may be contrasted with the more specialized abilities 
which are of value only in a limited range of tasks. Among the specialized 
abilities are mechanical comprehension, sense of pitch, and finger dexterity- 
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There is no widely used name for tests of this sort; we shall refer to them as 
measures of special abilities. While a test for a single specialized ability may 
be used by itself, it is more common to test several such abilities at once so 
as to study the person's ability profile. 

A proficiency test measures ability to perform some task which is signifi- 
cant in its own right: reading French, playing a piano, trouble-shooting an 
airplane engine. Since one of the principal uses of such a test is to evaluate 
performance of persons who have been given training in the task, these tests 
àre often referred to as achievement tests. 

. An aptitude test is one used to predict success in some occupation or train- 
ing course—there are tests of engineering aptitude, musical aptitude, apti- 
tude for algebra, and so on. In form, these tests are not distinctly different 
B üther types An engincering aptitude sur a m nde oe 
8 general mental ability, mechanical and spatial r g (sp 
ties), and proficiency in mathematics. The test is referred to as an achieve- 
Ment test when it is used primarily to examine the person's success in past 
Study, and as an aptitude test when it is used to forecast his success in some 


"ture course or assignment. 


Tests of Typical Performance 


Tests of typical performance are used to investigate not what the person 
can do but what he does. There is little value in determining how courteous 
à girl applying for store employment can be when she tries; almost anyone 
of normal upbringing has the ability to be polite. The test of a suitable em- 
Ployee is whether she maintains that courtesy in her daily work, even when 
She is not “on her best behavior.” To take another example, any inspector 


With Proper vision and training should be able to detect defective parts. A 
test which determines how well he spots defects when trying especially hard 
Would measure vision rather than carefulness. The chief difference between 
© good and the poor inspector is that the latter permits himself to be dis- 


acted and careless in run-of-the-mill duty. i 
For cheerfulness honesty, open-mindedness, and many other aspects of 
ehavior. a test of ability has almost no practical value. Most people can 


Produce a show of the behavior when it is demanded of them. But those 


9 act cheerfully, honestly, or impartially when they know they are being 
tested may not do so in other situations. Typical performance is important 


Sven when we are concerned with aptitude for success. If we are hiring an 
SXeeutive whose past success guarantees his ability, we also wish to know 
tow he usually operates. Does he supervise closely, dowa to the last detail? 

T does he outline a general task and turn his subordinates loose? Is he 
Squally concerned with production, human problems, and finances? Does he 
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prefer long-range planning or quick adaptation? Knowing his pattern is 
necessary to place him properly in the organization. 

In tests of ability a high score is desirable, but in most tests of typical per- 
formance no particular response can be singled out as *good." For example, 
there is nothing good or bad about interest in engineering. One who has 
this interest can use it, but one who does not finds other worth-while ac- 
tivities. Likewise, people show wide variation in dominance-submission in 
social relations. We cannot say that any certain degree of dominance is best, 
since our world has places for persons of all types. 

The person's characteristic behavior is our best clue to his personality. 
Habits have predictive value in themselves; what a person does once he is 
likely to do again. Most psychologists would object, however, to assuming 
that a person's observable habits are his personality. New situations contin- 
ually arise, and a description of his customary behavior does not directly in- 
dicate what he will do in a new situation. A boy may have a reputation as à 
womanhater, but some girl will come along who arouses a quite different re- 
sponse. A clinician who establishes warm relations with most clients will en- 
counter some who arose only hostile feelings in him. Because we do not wish 
to regard these exceptional reactions as capricious and unexplainable, we 
interpret reactions in various situations as reflections of a more basic and 
consistent "personality structure." This structure has to be inferred from be- 
havior. The psychologist hopes that when he understands the structure he 
will be able to predict the person's responses even to new situations. 

Testing of typical performance is difficult. It has been accomplished, with 
greater or less success, in a variety of ways. These methods may be divided 
into behavior observations and self-report devices. 

Behavior Observations. Behavior observations are attempts to study the 


subject when he is "acting naturally." Observations are made both in stand- 


ardized test situations and in unstandardized or “natural” conditions. 


The standardized observation requires that each subject be placed in es- 
sentially the same situation. Personality may be observed during a mental 
test, during a group discussion, or while the subject is walking a rail blind- 
folded. Special tasks are often devised which give an especially good oppor- 
tunity for observation. These tasks may be referred to as performance tests 
of personality. 

The standardized observation permits relatively exact comparison of per- 
sons who are not normally seen in similar circumstances. Moreover, it reveals 
characteristics which could be seen only occasionally in everyday life. Such 
procedures have been used, for example, to observe typical reactions to frus- 
tration. The person commences a task and is prevented in some way from 
attaining his goal. The way he reacts gives insight into his emotional control. 
In one famous study, preschool children were given the opportunity to play 
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With ordinary, reasonably interesting toys. Then they were allowed into an 
adjoining room with extremely attractive toys. After a period of play in this 
room, they were herded back into the first room, and a wire screen was 
placed between them and the attractive toys. 

As the artist's representation of this experiment shows (Figure 3), the chil- 
dren reacted in many ways: pounding on the fence, regressing to simple 
Play with rocks, trying to pry under the fence, or going off to take a pre- 
tended nap. The observers recorded the children's behavior, finding that 


t 


d to experimental frustration. (After 


FIG, 3, i i ildren subjecte: 
Varied behavior among children SD, I i from Morgan, 1942, p. 249.) 


9 study by Barker and others, 1941; drawing repro 


after frustration their games were less mature and less constructive than be- 
Ore, 
If an observation is to bring to light typical behavior, the subject must not 
OW what characteristic is being observed. The observer may be concealed, 


9r the subject may be led to believe that he is being tested on one behavior 
1. Thus when reaction to frustration 


w 
esc Something else is being observec oL 
old that his mental ability is being 


" ing studied, the subject may be t d : 
ested, His responses when he is frustrated by difficult questions are usually 


Senuine and little disguised. . 
ata on typical behavior may also be gathered by observing samples of 


© person's ordinary daily activities, “in the field,” as it were. Children on 
: Playground reveal a good deal about their habits and personality; so do 
wencoms leading platoons, and workers in the office. Field observations may 
SR elaborately standardized recording procedures—even sound-motion pic- 


th 
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tures—or may, on the contrary, consist merely of an impressionistic judg- 
ment. The baseball batting average is a summary of systematically recorded 
field observations. The industrial supervisor's merit ratings are also based 
on observation, but the judgments are almost completely unsystematic. 

Self-Report Devices. The subject has had much opportunity to observe him- 
self. If he is willing, he can give a helpful report of his own typical behavior. 
Questionnaires are used to obtain such reports. The crucial problem in self- 
report, if it is to be interpreted as a picture of typical behavior, is honesty. If 
the person tries to give the best possible picture of himself instead of a true 
description, the test will fail of its purpose. Even when he tries to be truth- 
ful, we cannot hope that he is a really detached and impartial observer of 
himself. His report is certain to be distorted to some degree. 

Most self-report inventories offer a fairly comprehensive picture of per- 
sonality. Some of them, however, are specialized in their coverage. There are 
study-habit inventories, interest inventories, social attitude inventories, and 
so on. Other tests are designated as “adjustment inventories,” “character 
tests,” etc.; such a name suggests the way in which the score is to be inter- 
preted but does not identify a distinctive form of test. 

It is generally agreed that personality questionnaires should not use the 
word test in their titles (Technical Recommendations, 1954, p. 10). If an in- 
strument is marketed under the title *The Jones Dominance-Submission 
Test,” employers, teachers, or others with limited psychological training may 
think that a person's dominance is being directly measured. If the instrument 
merely asks the subject a series of questions about himself, he can describe 
himself in any way he likes. A title such as "The Jones Dominance Ques- 
tionnaire” or “The Jones Dominance Inventory” is less likely to give an im- 
pression of trustworthiness than “The Jones Dominance Test.” It is desirable 


to use some term such as questionnaire whenever the word test might be 
misinterpreted. 


18. Classify each of the following 
observation in a standardized 
situation: 

a. An interviewer from the Gallup poll asks a citizen how he will vote in 4 
coming election. 

b. A television producer wishes to know what program features appeal to 
different types of listeners. He presents a show to a small audience, who 
press signal buttons to indicate whether they enjoy or dislike what they 
are seeing at each moment. 

c. A test of “vocational aptitude” asks the subject how well he likes such ac- 
tivities as selling, woodworking, and chess. 

d. A spelling test is given to applicants for a clerical job. 


e. Inspectors in plain clothes ride buses to determine whether operators are 
obeying the company rules. 


Procedures as a test of ability, a self-report test 
situation, or observation in an unstandardize 
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f. During an intelligence test, the examiner watches for evidence of self- 
confidence or its absence. 

g. In a test of “application of principles in social studies," students are told of 
a conflict about admitting Negroes to a housing project. They are asked 
what the city council should do and to give reasons to support the choice. 

h. An inspector in a stocking factory is supposed to detect all stockings with 
knitting faults. To check her efficiency, at certain times a number of faulty 
stockings which have been marked with fluorescent dye are mixed into the 
batch for inspection. The dye is invisible to the worker, but by turning an 
ultraviolet lamp onto the stockings after inspection the supervisor can 
readily locate the faulty stockings which the inspector missed. 


P 
rocedural Terms 


There are a number of miscellaneous terms designating tests according to 
their Procedure. The meaning of such terms as pencil-and-paper test, ap- 
ei test, oral test, and so on should be obvious. Although - tests Free 

eriormance ; " »rformance test is usually applied to 
tests agi som " E perf ag the performance isis which 

g a nonverbal response. mong p 


EM iri i electronic ap- 
€ been used for various purposes are repairing a piece of e f 


Audi P UN 
Paratus, drawing a picture of a man, stringing beads, and “inventing a hat 


Tac o 

ack when given two long sticks and a C-clamp. 

Group tests differ from individual tests in that the former permit many 
be given to a single individual 


7 pod to be tested at once. Group tests can ? i - i 
at is desirable. Many individual tests require careful oral questioning or 
s servation of reactions. Some individual tests can be modified and simpli- 

ed to permit group administration. An example is the Rorschach test of 
Personality, In the individual form of that test, a subject looks at a card bear- 
"E an inkblot and tells what he thinks the blot looks like. He is questioned 
about each response until the tester is sure just what the subject soes, In the 
Sronj form, the blots are projected onto a screen. Subjects write their re- 


Spo; NR ; 
Ponses, and individual questioning is omitted. 


has developed in recent years. The term 


Nother meani 
eaning up test 
now of ning for group 


eac may be asked to work together to SO 
Person is observed. 


using as many of the descriptive terms dis- 


able. 
d questions, such as 


19 : 
Classify each of the following tests, usi' 
cussed in the text as are clearly applic 
a. The Study of Values consists of printe 
In your opinion, can a man who works in business all the week best spend 
Sunday in 


" s k: 
a. trying to educate himself by reading serious books 
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b. trying to win at golf, or racing 
c. going fo an orchestral concert 
d. hearing a really good sermon 


The subject answers each question by checking whichever answer he pre- 
fers. Answers are scored by a numerical key to determine how important 
"aesthetic," "religious," and other values are for him. 

b. In the Stenquist Mechanical Aptitude Test, the subject marks illustrations of 
tools and other objects to show which go together (e.g., hammer and 
anvil). 

c. A Picture Arrangement Test item presents a set of four pictures which, ar- 
ranged in the correct order, tell a story in the manner of a cartoon strip 
(see Figure 33). Each picture is on a separate card. The cards are presented 
in a random arrangement and the subject arranges them to make an in- 
telligible story. 

d. In a finger dexterity test the subject mounts washers on rivets and places 
each one in a hole on a special board, working as rapidly as possible. 

20. Classify the procedures used by Miss Kimball, the school psychologist described 
in Chapter 1, according to the terms used in this chapter. 


Suggested Readings 


Baldwin, Alfred L. The role of an "ability" construct in a theory of behavior. In 
David C. McClelland & others, Talent and society. New York: Van Nostrand, 
1958. Pp. 195-233. 

Baldwin discusses the nature of ability and the theoretical requirements of 
ability tests. His argument that only voluntary behavior shows ability, 4$ 
distinguished from habit, amplifies our distinction between maximum per 
formance and typical performance. 

Bingham, Walter V. On getting rattled. Personnel Psychol., 1950, 3, 105-111. 
This article describes some apparatus tests for measuring coérdination which, 
with slight modification, can be used to observe temperament. 

Melton, Arthur W. (ed.). Problems and techniques of mass testing with apparatus. 

In Apparatus Tests. Washington: Government Printing Office, 1947. Pp. 22-53. 
The aptitude testing program discussed in the: 


se reports was the most 
elaborate one ever conducted. 


This chapter shows what had to be taken 
into account in standardizing procedures so that men tested in California 
could be compared precisely with men tested in Texas. 

Munn, Norman L. Intelligence, and The assessment of personality. Psychology: 

(3rd ed.) Boston: Houghton Mifflin, 1956. Pp. 48-81, 170-181. 
An introductory textbook describes a great variety of prominent tests, 
giving drawings or photographs of most of them. The chapter on intelligence 
also provides considerable information on the nature and growth of intelligence. 


Administering Tests 


any intelligent adult to give success- 


S 3 
OME tests are sufficiently simple for 
are required before 


qus others are so subtle that months of special training — 
M Ster can do a fully effective job. In general, group tests require less 
training to administer than individual tests, although there are some excep- 
tions, If the tester has no responsibility save to read a set of printed direc- 
tions ng person should be successful. Where 
dividually and to use follow-up 
at skill and experience are re- 


it is jen conscientious, nonthreatenit 

ecessary to question the subject in 
uestions if the first answer is unclear, gre 
quired, 

The tester must take pains to give every subject a chance to exhibit his 
ability, and to obtain results comparable to those of other testers. The im- 
Titane of rigorous adherence to prescribed testing procedure is especially 
bios i d grat competitive testing prag fr slump ert 
t college admissions. The Scholastic Aptitude eae $ 
“ince Examination Board, the most prominent example, is given in 1000 


Merican centers and 40 foreign ones. At 9 A.M. on à particular Saturday in 
Anuary, the seal is broken on the test package in each center: in Bronxville 
and Berkeley and Bell Buckle, Tennessee, in Beirut and Kubasaki and 
Odaikanal, The completed papers pour into the scoring centers and reports 
a ut to the colleges. A boy tested in Beirut may be in competition with one 
a Berkeley for admission to the same college, and the selection procedure 
Unfair unless the two are tested in an jdentical manner. a 
, ? assure fair testing, the tester must become thoroughly familiar with the 
est. Even asoni Qin lly or two stumbling blocks which 
ple test usually 


Can a 
be anticipated if the tester studi 
€ tester must maintain an impartia 


presents one 
es the manual in advance. 


] and scientific attitude. Testers are 


"sually keenly interested in the persons they test, and desire to see them do 

hig ee ee beginning tester is tempted to give hints to the subject 

T to coax him Usa greater effort. It is the duty of the tester to obtain 
37 
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from each subject the best record he can produce; but he must produce this 
by his own efforts, without unfair aid. The tester must learn to suppress not 
only direct hints but also those unconscious acts which serve as cues to the 
subject. . 

This is especially a problem in individual testing, where each question 1S 
given orally. On a mental-test item where the child is supposed to receive 
only one trial, his answer may show that he did not comprehend the quas: 
tion. The tester will often be tempted to repeat the question "since the child 
could certainly have answered correctly if he had understood what was 
wanted"; this must not be done, since the test directions permit only one 
trial. Adjustments are sometimes warranted, however; for example, the re- 
sult might be discarded (rather than scored as wrong) if an outside dis- 
turbance caused the child's failure. 

Unintended help can be given by facial expression or words of encourage 
ment. The person taking a test is always concerned to know how well he 1s 
doing, and watches the examiner for indications of his success. Suppose he B 
given the task: "Repeat backward, 2-7-5-14.” He may begin "4-1-7 . . +3 
if the examiner, on hearing the “7,” permits his facial expression to change 
the subject may take the hint and catch his own mistake. The examiner must 
maintain a completely unrevealing expression, while at the same time Si 
lently assuring the subject of his interest in what he says. 

Maintaining rapport is necessary if the subject is to do well. That is, the 
subject must feel that he wants to coóperate with the tester. A teacher who 
knows and likes a child, or a counselor who has worked with an adult, ca? 
often secure more spontaneous and representative performance than a stran- 


ger called in to administer tests, Those who are acquainted with the subject 


however, will be less impartial and must be unusually circumspect in follow- 


ing procedures. No rules can be given for the establishment of rapport, but 
the tester who likes people will develop many techniques. The person who 
proceeds coldly and “scientifically” to administer the test, without convincing 
the subject that he regards him as an important human being, will fre- 
quently find it difficult to maintain coöperation. Poor rapport is evidenced 
by inattention during directions, giving up before time is called, restless- 
ness, or finding fault with the test. 

This chapter gives a general introduction to test administration. It cannot, 
of course, make the reader into a skilled tester; that comes only with practice. 
To clarify our discussion in this and the next three chapters, we digress here 
to describe the Bennett Test of Mechanical Comprehension and the Block 
Design test in some detail. These tests are important in themselves, but We 
present them here so that we can refer to them to illustrate general principles 
of testing. If possible, the reader should take each of these tests himself. 
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TWO SPECIMEN TESTS 


Test of Mechanical Comprehension 


The Test of Mechanical Comprehension (TMC) originated by George K. 
Bennett is one of the most widely used tests in the "special ability" group. 
The first form appeared in 1940. Four forms were published under Bennett's 
name, other versions have been included in military classification batteries 
and vocational aptitude tests, and a recent version is contained in the im- 
portant DAT battery for high-school guidance. A list of the forms and their 
Purposes was given in the catalog excerpt in Chapter 1 (p. 14). 

The test manual (Bennett, 1947) begins with this description of purpose:* 


The Test of Mechanical Comprehension measures the ability to perceive and 
understand the relationship of physical forces and mechanical elements in practical 
Situations, This type of aptitude is important for a wide variety of jobs and for 
engineering and many trade school courses. 

. Mechanical comprehension may be regarded as one aspect of URP den if 
Intelligence is broadly defined. The person who scores high in a us = : to 
earm readily the principles of operation and repair of complex devices. Mad 
aptitude tests, it is influenced by environmental factors, but not to an £s ^ hat 
"roduces important difficulties in interpretation. Formal training in p ysios ap- 
Pears to increase the score by not more than 4 points. Care has wo ta a to pre- 
Sent items in terms of simple, frequently encountered mechanisms that do not re- 
Semble textbook illustrations or require special knowledge. . 

The test booklet carries instructions to the student and draws his attention 
to two specimen items (Figure 4). The manual carries the following direc- 


tions to the tester: 


This test has no time limit, Ordinarily, a great majority snes > ee 
Enty to twenty-five minutes; little is gained by allowing more than Huny 


em heet You have been given a 
After distributi kj answer sheets, Say: 
test E distributing the booklets i te sheet for your answers. Be sure to 


t booklet containing questions and a separa yon 
‘rite on only a ostii dhach: Make no marks on the booklet pum (— 
h Ow look at the directions printed on the cover of your test booklet white I rea 

t em al 
Oud to you. ;  [E.g. 
e ill in the caste information on your ANSWER SHEET. [E.g., name, age, 
* date, last grade completed.] - - - 


A t the "P, » 
ow line up your answer shect with the test booklet so that the “Page 1 


J t. Demonstrate 
SIR «p oe 1” arrow on the answer shee A 
TI w on the booklet meets the “Page 1 Drum amiss Pto saei aud ain 


len ; . It s s 
“Whi “il at Sample X on this Lo e it has neither rugs nor curtains, there 
ch room has more of an echo: d by permission. Items copyright 
irea i items reproduced by P ý ii 
94] J'eclions, norms, and specimen 


1947, by The Psychological Corporation- 


40 ESSENTIALS OF PSYCHOLOGICAL TESTING 


X 


Which room has more of an 


echo? 


B 


Y 


Which weighs morc? 
(If equal, mark C.) 


FIG. 4. Mechanical Comprehension Test items. (Sample X is from Form AA of the Bennett test; and 
Sample Y from Form A of the DAT Mechanical Reasoning Test. Both items used by permission 2 
The Psychological Corporation. Bennett item copyright 1940, 1941, 1955; DAT item copyright 1947. 


is more of an echo in room “A,” so blacken the space under "A" on your answe! 
sheet. Now look at Sample Y and answer it yourself. Fill in the space under the 
correct answer on your answer sheet. Are there any questions? If the answers on 
the answer sheet are not directly opposite the questions, raise your hand. 

After Sample Y has been answered, say: On the following pages ihere are more 
pictures and questions. Read each question carefully, look at thë picture, and fi 
in the space under the best answer on the answer sheet. Make sure that your marks 
are heavy and black. Erase completely any answer you wish to change. Be certain 
that you use the right column on the answer sheet for each page. The arrow on the 
page should meet the arrow on the answer sheet. These arrows are at a differen! 
place on each page to help you. 

Now open your booklets and fold back the cover so that only Page 2 shows: 
like this. Demonstrate. Then slip your answer sheet under the booklet and line it 
up so that the arrows for "Page 2" meet, like this. Demonstrate. When you finish a 
page. go right on to the next. Now begin the test. Answer all the questions; you 
will probably have plenty of time to finish. If you have any questions, raise your 
hand. The examiner should make sure that everyone understands how to use the 
answer sheet. 
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If the answer sheets are to be machine scored, the examiner should include ap- 
Propriate directions regarding the special pencils. 


The answer sheet is similar to that illustrated on page 67. 


Block Design Test 


The history of the Block Design test illustrates the way in which tests de- 
velop. S. C. Kohs was a clinical psychologist who invented the procedure and 
made up a set of items (Kohs, 1993). It was only one among a large number 
of mental tests invented during the 1920's, when applied psychology first 


came into prominence. As schools began to hire psychologists to examine 
ardized collections of tests. A psy- 


children, a demand arose for well-stand 
chologist acting as editor collected tests by various authors, improved the di- 
rections, materials, and scoring procedures, and applied the whole set to a 
large group of typical pupils i obtain standards of comparison. Several such 
collections Sara made, including those of Grace Arthur and Pintner and 
oe each being designed to fill slightly different needs, TR eger ge 
procedure was used in many of these collections, being a good nonver 
thetic reasoning with a wide range of diffi- 


Dd TA . 
al measure of analytic and synthe 
f as continued down to the present 


culty, Revision and restandardization h 

ay. Each modification alters the number of items or the directions, or intro- 

duces new designs. We shall describe the test in the recent WAIS version 

(Wechsler, 1955, p. 47). 

in makes use of a set of cok ; 
n’s play. The test instructions begin 

Start with Design 1 for all subjects. Take four blocks and say You see these 


blocks, They are all alike. On some sides they are all red; on some, all white; and 
On some, half reil vini half white. Turn the blocks to show the different sides. 

en say I am going to put them together to make a design. Yatch me. Arange 
iei four blocks slowly into the design shown on Card 1, Paper e mis ns 1 
N the Subject. Then, leaving the model intact, give four other blocks to the subject 
and say Now make one just like this. If the subject successfully completes the de- 


Sign Within the ti "- : -ats and proceed to Design 2 

the ice un poii p within the time limit or arranges the 

S incorrectly. pick up his blocks, leaving the examiners model aten and say 

u atch me again. Demonstrate a second time using subject's blocks, then mix them 

m still leaving the examiner’s model intact, and say Now you try it and be sure to 
ake it just like MAT Whether subject succeeds or fails on this trial, proceed to 


“sign 9, 


»red one-inch cubes originally sold for 
as follows: 


bloc 


examiner's model exactly, in- 


licate the 
duplic ]l the subject that only the top 


Occasi 

Blas lonally a subject will try to . e 

UE the sides, When this occurs on Design 1, te 
5 to be duplicated. 

e Second sample is then administered in 

Bins with Design 3. The directions are: 


Dee 


a similar manner. The test proper 
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Designs 3-10. Place the card for Design 3 before the subject and provide bee 
with four blocks. Say Now make one like this. Tell me when you have finished. 
When the subject indicates he has finished or at the end of the time limit, mix up 
his blocks and present Design 4 with the remark Now make one like this. GO 


Blocks Pattern 


FIG. 5. Block Design test materials. (Pattern copyright 1940, © 1955 by The Psy- 
chological Corporation. Reproduced by permission.) 


ahead; let me know when you have finished. Follow this procedure for all succeed- 
ing designs. 

When Design 7 is reached, take out the five other blocks and say Now make one 
like this, using nine blocks. Be sure to tell me when you have finished. For Desig? 
10 [which has an irregular outline], do not permit the subject to rotate the card to 
give the design a flat base. However, give full credit if his reproduction of the 
design is rotated not more than 45°. 


Time Limits Designs 1-2 60 seconds (Time each trial separately) 
Designs 3-6 60 seconds 
Designs 7-10 120 seconds 


Record time taken for the subject to complete each design if it is done correctly 
within the time limit; bonuses are given for rapid performances on Designs 7-10. 

Discontinue After 3 consecutive failures. Failure on both trials of either D& 
sign 1 or Design 2 is considered one failure. 


As in other individual tests, the tester observes the subject’s performance 
with care. He notes the time required to complete each task, and any error? 
In addition, he watches for any revealing remark, any emotional reaction pr 
blocking, and any unusual method of attacking the task. Some persons are 
cautious and some are impulsive. Some deal with the pattern as a whole a? 
some must consider each tiny section in turn. Some give up when they face 
difficulty, some become erratic and make the same error repeatedly, an 
others show increased interest under the greater challenge. 


1. If the TMC were administered individually, could profitable observations be 
made? P 

2. Can you think of any questions a subject might ask which the TMC directions 
not cover? 
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3. Wechsler's directions specify only that the blocks are to be "mixed up" after 
each trial. Could this procedure be standardized more exactly? Should it? 


4. Wechsler prescribes that each sample should be demonstrated only twice. Even 
Fhe subject ic unable to do the task on the second sample, the tester proceeds 


K to the next design. ls this a wise procedure re NRA 
ve W N 


The i 

fes uns is not regarded as sufficient to prepare one to gi 

eee tester learns by observing an experienced tester and discussing pro- 

es with him. What do you think you could learn about giving the Block 
gn test that the manual did not tell you? 


PR 
OCEDURE FOR TEST ADMINISTRATION 


Conditions of Testing 


"eres Asi pr oblems of administration are common to all tests. The 
and lightin x ne physical situation where the test is given. If ventilation 
ticularly as = poor, subjects will be handicapped. On speed tests par- 
convenient ^ Scores will be lower than they deserve if they do not have a 
Subjects x ie to write, including sufficient space to spread out materials. 
tions clem] S = placed so that they can hear directions and see demonstra- 
proctors HE. A: en large rooms are generally bad for group testing, unless 
advantage a aper to watch subjects closely. The large room has the dis- 
tions which sie a person may hesitate to ask a question about unclear direc- 
having him + would raise before a smaller audience. This may be solved by 
Swer his b his hand so that a proctor will come to his seat and an- 
Tl question. 
he aan E the person tested affects the results. If the test is given when 
disturba - , when his mind is on other problems, or when he is emotionally 
MM, results will not be a fair sample of his behavior. Occasionally it is 
Biche, to test a person at an unfavorable time, as when psychological ex- 
inii. a must be given to a criminal at the time of his trial. Tests to be 
Ni ih nit and guidance of college freshmen are frequently given 
A eee st of a hectic week of orientation, college activities, establishment 
iie gh Hends and living arrangements, and adjustment to a semiadult 
dine M a freshman who later proves to be normally intelligent 
emotio y badly on placement tests because of homesickness, distraction, 
nal exhaustion, or unidentified causes. While tests given under these 
conditions do have predictive value for most of the group, some individual 
Scores are misleading. If a test must be given at à psychologically inoppor- 
né time, the only correct procedure is to maintain an adequately critical 
rude toward results. Conditions can often be improved by spacing tests 
cts cumulative fatigue, providing for adequate rest on the night before 
S, and administering the program with a minimum of bustle and con- 
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During World War II, the General Classification Test was often given to 
soldiers just after induction when they lacked sleep, were recovering from 
a farewell party, or felt ill from inoculations. In one study men who took a 
second form of the test after becoming stabilized in Army routines raised 
their scores 11.25 points on the average. This is a large enough shift to raise 
a man from the category of potential noncom to that of potential officer 
(Duncan, 1947). 

Time of day may influence scores, but it is rarely important. Alert subjects 
are more likely to give their best than subjects who are tired and dispirited. 
Equally good results can be produced at any hour, however, if the subjects 
want to do well. In most instances, fatigue apparently affects motivation 
rather than the ability one can summon up. The most thorough examination 
of hour-by-hour variation was conducted by Air Force psychologists (Mel- 
ton, 1947, pp. 49-51). In one study of 2500 cadets being classified at Buckley 
Field, Colorado, they found striking and significant differences in psychomo- 
tor test performance (finger dexterity, rudder control, discrimination reac- 
tion time, etc.). In general, performance was at its peak between 10 AM. 
and 3 p.m. In an attempt to confirm and interpret this difference with fur- 
ther tests of nearly 9000 cadets at other places, negligible differences were 
found. The inconsistency has not been explained, but it appears that under 
most operating conditions fluctuations during the day can be avoided. The 
experience at Buckley Field warns the tester never to close his mind to the 
possibility that error may enter his tests from unexpected sources. 


Control of the Group 


Group tests are given only to reasonably mature and coóperative subjects 
who expect to do as the tester requests. Group testing, then, is essentially ? 
problem in command. For efficient testing, subjects must follow instructions 
promptly and all must do the same thing. This attitude must be maintain® 
without interfering with the opportunity of individuals to ask questions. One 
person should be in charge, standing in front of the group where he can see 
all members. He will find helpful the adage “Never give an order unless you 
expect it to be obeyed.” False starts, preliminary attempts to call the group 
to order while late-comers are finding seats, and ineffectual rapping for at- 
tention make it more difficult to secure real conformity when work begins: 
The tester should have full attention when he starts to talk, so that repet 
tion will not be necessary. 

Directions should be given simply, clearly, and singly. A complex instruc” 
tion: “Take your booklet, turn it face down, and then write your name a 
the answer sheet,” will lead to misunderstanding and confusion. It is bete? 
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to break the instruction into unmistakable simple units: "Take your booklet. 
(Hold a sample up, and watch the group to be sure everyone has taken his 
booklet before proceeding.) "Turn it face down.” (Demonstrate and wait 
until everyone has complied.) "Now take your answer sheet.” (Exhibit a 
Sample, and wait for compliance.) "Write your name on the blank at the top, 
last name first." The subjects have a chance to ask questions whenever they 
are necessary, but the examiner attempts to anticipate all reasonable ques- 
tions by full directions. 

Military techniques are effective for control of a group. When a military 
Manner is assumed, however, it may enhance the “inhuman” character of the 
test situation and give some people the feeling that the examiner is not in- 
terested in their welfare. Effective control may be combined with good rap- 
Port if the examiner is friendly, avoids an antagonistic, overbearing, or fault- 
finding attitude, and is informal when formal control is not called for. After 
establishing control, for example, he may often relax his “command manner” 
and make informal comments about the test and its purpose; this does not in- 
terfere with his resuming formal control for the test proper. 

Emergencies arise which prevent uniform testing of all persons. Occasion- 
ally, for example, a person becomes ill during the test and must leave the 
Saot, Usually it will be possible to collect his materials, indicate that the test 
is invalidated, and provide for a make-up on another occasion, perhaps with 
2 different form of the test. The goal of the tester is to obtain useful informa- 
tion about people. There is no value in adhering rigidly to a testing schedule 
if that schedule will not give true information. Common sense is the only safe 


ui i R i 
Suide in the exceptional situation. 


6. An employment office gives all applicants an intelligence test when their ap- 
Plications are filed. One man takes the test, together with several friends, and 
? group leave together. Ten minutes later he returns, Wee ie he 
SUpposed to turn over the last page? ! thought | had finished when T to the 
Ottom of page 9, so | looked back over my answers. l had plenty of time, and 
: page—my friends say the questions 


in thi. i the bottom of page 9 
ere w " be done in this case, if at | 
ere easy." What should “Go on to the next page"? 


obtain information for use in guidance, 

arrived from Latin America is having 

great difficulty following directions because of unfamiliarity with English. The 

Student asks many questions requests repetitions, and seems unable to compre- 
7 


end ; H xaminer do? 
8. What is desired. What should e ; preschool child who is believed to be 


n the co - lysis of a 
urse linical analysis x: 5 
Poorly SE . prea raul of tests is requested. The psychometrist 
Who had r ^ P the child is negativistic. After coóperating reluctantly 
e tests tinds less on the third. Assuming that the 


9n two t n * nd care 
ests he becomes inattentive a 2 
test results are needed as soon as possible, what should the tester do? 
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Directions to the Subject 


The most important responsibility of the test administrator is giving direc- 
tions. The purpose of standardized tests is to obtain measurements which 
may be compared with measurements made at other times; it is therefore 
imperative that the tester give the directions exactly as provided in the man- 
ual. If the tester understands the importance of this responsibility, it is sim- 
ple to follow the printed directions, reading them word for word, adding 
nothing and changing nothing. 

The standard directions usually invite the subject to ask questions after 
the directions have been read. In answering such questions, the tester 
must not add to the ideas expressed in the standard directions, since such 
supplementation might give this subject an advantage over those not having 
such aid. The directions are part of the test situation; in some intelligence 
and personality tests the way the subject follows directions is intended to in 
fluence his score. 

The most troublesome questions concern matters not discussed in the 
standard directions. Examples are: "Should we guess if we are not certain? 
“How much is taken off for a wrong answer?" “Are there any catch que? 
tions?" “If I find a hard question, should I skip it and go on, or should I an* 
swer every question as I go?" The published directions to the test wet? 
evidently not adequate if they ignored these topics. When the tester refuses 
to give an answer to the questions about guessing—and he must refuse if the 
scores are to be compared with norms—some subjects will guess and some 
will not. Therefore, while the directions are superficially standard, the T?" 
cedure becomes unstandardized because subjects interpret indefinite D 
structions, each in his own way. 

Attempts to test skill in flying have shown the crucial importance of de- 
fining the task clearly for the subject. In making a check test on ability to 
execute a maneuver, testers found it necessary to tell the pilot exactly how 
the performance would be scored. When they omitted this, one pilot kept 
his attention on maintaining altitude perfectly, whereas another of equ? 
ability earned a different score because he concentrated on holding t3? 
planes heading steady. Tests should be provided with directions whic 
leave no ambiguities for variable interpretation. When the tester must use F 


standard test for which directions are imperfect, he faces a diffculty d 
which there is no ideal solution. 


9. The Bennett TMC directions are printed in the student's test booklet, and also 
read word for word by the examiner. (The only difference is the sentence ! 
which the examiner asks if there are any questions.) Why is it desirable to rea 
the directions aloud instead of allowing the student to read to himself? 


10. How would you answer the following questions raised by students after hearing 
the TMC directions? 
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a. Is this a speed test? 
b. If | am not sure of the answer should | mark what | think is best? 

11. The California Test of Mental Maturity consists of twelve sections, each con- 
taining a different type of item. The sections are separately timed, each re- 
quiring 3-10 minutes. Is there any reason why a high school seeking data for 
guidance should not give pupils one or two sections of the test each day until 
all of it is taken, rather than giving it in one or two sittings as the manual 


suggests? 


Judgments Left to the Examiner 


While the directions will be standardized in many respects, it is unwise to 
Standardize the tester’s procedures too rigidly. Precisely the same action or 
remark by the examiner can have a different significance for different sub- 
Jects, and if so, rigid procedure itself introduces an unstandardized element 
Into the testing. This may be illustrated first with regard to the problem of 
terminating a test. 

Directions for most tests place some 


ADY problem or to work on any subtest ( 
example). To conform to the directions, the tester need only use his stop 


Watch attentively, Where no time limit is stated, it is still necessary to stop 
© painfully conscientious subject who works long after he has done his best. 
Sometimes an individual test allows credit only when a task is done in 

Say) two minutes, but does not tell the examiner to stop the subject at that 

Hime. The tester iust decide whether to let the subject work after he has 

Passed the credit limit or to interrupt him. This is one of the situations 

i ere the art of testing comes into play; no rules can prescribe how to ter- 
Inate 


limit on the time allowed to solve 
see the Block Design directions, for 


I an unsuccessful trial. T — 
nan ability test like Block Design, success on one problem has an encour- 


aging effect during the next, put the effect of failure depends on the tester. 
a the tester’s eyes, the subject fails when he does not complete the task 
Within the time limit. If the subject is allowed to continue without interrup- 
ton, however, he may finish the task and think that he has succeeded. This 
E Course tends to help his subsequent performance. On the other hand, 
ven in extra time the subject may be unable to solve the problem. When he 
*PPears to be getting confused and upset, it may be best to terminate the 
TOblem and T ie a fresh start. To let him continue might make him 
opeless]y [cei In the absence of definite instructions on procedure, 
© tester should observe the subject's attitudes carefully and choose what- 


Terma in their advice to examiners giving 
n and i . 55-58), in ; 
the Stanford: irma s piri further illustrations of the necessity for 


Ariat: 
‘ation in tactics: 
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The tests of each year group should be given in the order in which 
they appear in the manual. . . . In order to secure the child's best ef- 
fort, however, it is sometimes necessary to change the test sequence 
For example, if the child shows resistance toward a certain type of test, 
such as repeating digits, drawing, etc., it is better to shift temporarily to 
a more agreeable task. When the subject is at his case again, it is usu- 
ally possible to return to the troublesome tests with better success. Such 
difficulties are particularly likely to be encountered in the testing of pre- 
school children. This group presents so many special problems that we 
have felt it necessary to give a separate discussion of the techniques of 
pre-school testing. 

The examiner's first task is to win the confidence of the child and to 
overcome any timidity he may feel in the presence of a stranger. Unless 
rapport has first been established, the results of the first tests are likely 
to be misleading. The time and effort necessary for accomplishing this 
are variable factors, depending upon the personality of both the game 
iner and the subject. It is impossible to give specific rules for the guid- 
ance of the examiner in establishing rapport. The address which flatters 
and pleases one child may excite disgust in another. The examiner pol 
himself be genuinely interested and friendly or no amount of skilled 
technique will enable him to establish a sympathetic, understanding ie 
lationship with children. There are people who lack personal adaptabil- 
ity to an extent that makes success in this field for them impossible. 
Such a person has no place in a psychological clinic. 

Nothing contributes more to satisfactory rapport than keeping the 
child encouraged. This can be done in many subtle, friendly ways; by i 
understanding smile, a spontaneous exclamation of pleasure, an appre 
ciative comment, or just the air of quiet understanding between equals 
that carries assurance and appreciation. Any stereotyped comment fol- 
lowing each test becomes perfunctory and serves no purpose other than 
to punctuate the tests. In general it is wise to praise frequently and ge? 
erously, but if this is done in too lavish and stilted a fashion it is likely t° 
defeat its purpose. The examiner should remember that he is giving aP” 
proval primarily for effort rather than for success on a particular T€" 
sponse. To praise only the successful responses may influence effort 1" 
the succeeding tests. Praise should never be given between the items e 
a particular test, but should be reserved until the end of that test [i27 
subtest]. Under no circumstances should the examiner permit himself 
to show dissatisfaction with a response, however absurd it may be. with 
younger children, especially, praise should not be limited to tests 9” 
which the child has done well. Young children are characteristically U” 
critical and are often enormously pleased with very inferior responses: 
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In praising poor performances of older subjects, the examiner should re- 
member that the purpose of commendation is to insure confidence and 
not to reconcile the subject to an inferior level of response. In the case 
of a failure that is embarrassingly evident to the child himself, the 
examiner will do well to make some excuse for it. Expressions of com- 
mendation should be varied and should fit naturally into the conversa- 
tion. 

Although the examiner should 
that he can answer correctly if he will only 
mon practice of dragging out responses by 
questioning. To do so often robs the response o 
to interfere with the maintenance of rapport. While the examiner must 
be on his guard against mistaking exceptional timidity for inability to 
able to recognize the silence of incapacity or 


always encourage the child to believe 
try, he must avoid the com- 
too much urging and cross- 
f significance and is likely 


respond, he must also be 
the genuineness of an “I don’t know. 
The competent examiner must possess in a high degree judgment, 


intelligence, sensitivity to the reactions of others, and penetration, as 
well as knowledge of and regard for scientific methods and experience 
in the use of psychometric techniques. No degree of mechanical per- 
fection of the tests themselves can ever take the place of good judgment 


and psychological insight of the examiner. 


ra The Bennett directions are vague as regards timing: "Little is gained by allow- 
ing more than thirty minutes." The DAT form of the test is definite. "At the end 


of 30 minutes, say 'Stop!! " What are the advantages and disadvantages of 


the two procedures? 


Suessing 

ability test, some subject is likely to ask, 
ain?” Sometimes the test directions include 
here such advice is given, some ambi- 
r the tester to give supplementary ad- 
s “Use your own judgment.” The dis- 
Jarify the guessing problem for the 
edure in giving tests. 

ak as if items fall into two cate- 


«, At the start of an objective 
an ci I guess if I am not cert 
“nswer to this question, but even w 
vio remains, It is against the rules fo 
cussion er retreat to such a formula a 
tester pce follows is intended to i. i 
ut should not influence his proc 


° simplif i ion, we can spe 
: y the discussion, v ; 
Bories: those for which the subject knows the answer, and those for which he 


965 not know it. Tf the item calls for a choice of alternatives, the subject has 
Dance of picking the correct response even on the items he does not know. 
there are two alternatives, as in true-false items, he will succeed by chance 
qoum on 50 percent of his guesses. In scoring à two-choice test, we assume 
at any wron g choice represents an unlucky guess, and that the number of 
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lucky guesses is equal to the number of wrong guesses. The final score on a 

true-false test is counted as "number of items right minus number marked 

wrong," i.e., total number of items marked correctly less the number thought 

to have been marked correctly by guessing. If there are n choices per item, 

the chance probability of a correct guess is — and that of an unlucky guess is 
n 


n-1 


. For every n — 1 incorrect guesses, we expect 1 correct guess. Hence 


the scoring formula most often used is “Right minus Wrong o Other scoring 
n= 


formulas have been developed, some of which are probably superior to this 
one, but none of them is much used. In a test with a liberal time allowance 
and comparatively easy items, subjects usually mark every item. When that 
happens, the rank order of the scores remains the same whether the score 
used is “number right” or R — mm 

A correction formula is desired because some people guess more freely 
than others. The guessers would mark many right answers by chance alone. 
The scoring formulas attempt to wipe out gains due to guessing. Unfortu- 
nately, the basic logic described above does not describe the situation fairly» 
and the formula does not truly “correct for guessing.” The basic assumption 
is incorrect. You cannot divide items into those the subject knows perfectly 
and those he does not know at all. There are items he knows fairly well but 
is not positive of, and other items where he has hazy knowledge. “Guessing 
is not a matter of pure chance. Even on the items he knows least about, the 
guesser's experience and common sense should permit him to choose cor 
rectly more often than he would if he selected answers by rolling dice. A per 
son who guesses intelligently on ten five-choice items can expect to ge 
perhaps four items right, instead of the two items expected from chance 
guessing. By formula, four right answers would give him a score of 2% points: 
Since a person who does not guess receives a score of zero on the same 10 
items, the score is raised by willingness to gamble. 

The subject decides what risk he is willing to run. Some people mark only 
the items they are very sure of. Others mark any item they think they under- 
stand, and still others mark absolutely every item. This difference in ter% 
ency to gamble is not eliminated by any change of directions or penalties. 
As the penalty becomes more severe, guessing diminishes, but the bold stil 
take more chances than the timid (Swineford, 1941; Torrance and ziller, 
1957). 

Even if the standard “correction for chance” is used, the person who gam 
bles on every doubtful item is likely to gain more than he loses. The only €% 
ception is where the test constructor is so skillful in writing misleading e 
ternatives that the guesser is likely to pick them in preference to the 118 


pO 
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answer. Therefore the person taking a test is usually wise to guess freely. 
(But remember that the tester is not to give his group an advantage by 
telling them this trade secret! ) 

: From the point of view of the tester, tendency to guess is an unstandard- 
ized aspect of the situation which interferes with accurate measurement. 
Most European group tests remove the opportunity for blind guessing by 


Presenting items in “open end” form, where the subject must write or draw 


the answer. American group tests, however, are almost always in multiple- 
less time and can be scored me- 


choice form because these items require 
chanically, 
The systematic advantage of the guesser is eliminated if the test manual 
Eun everyone to guess, but guessing introduces large chance errors. Sta- 
Stical comparison of “do not guess” instructions and “do guess” instructions 
shows that with “do not guess” instructions the tests have slightly greater 


Predictive power (Greene, 1952, pp- 73-75; see also Lindquist, 1951, pp. 


947 &.). Chance errors multiply when everyone guesses, and their cumula- 
outweighs the advantage of "do 


tive influence on accuracy of measurement 

not guess” instructions. The most widely accepted practice now is to edu- 

Cate students that wild guessing is to their disadvantage, but to encourage 

them to respond when they can make an informed judgment as to the most 

Teasonable answer even if they are uncertain. The following advice given 

vs College Board applicants is much fairer than strict instructions not to 
ess; 


When the test is scored, a percentage of the wrong answers is sub- 
tracted from the number of right answers as a correction for haphazard 
Buessing. It is improbable, therefore, that mere mem ben NHprove 
your score significantly; it may even lower your score. ^t, owever, you 
are not sure of the correct answer but have some knowledge of the 
Question and are able to eliminate one OF more of the answer choices as 
Wrong, your chance of getting the right answer is improved, and it will 


be to your advantage to answer such a question. 
ptions) have two alternatives. The 


13, 
What effect will this formula have 


Bennett Form AA items (with very few exce 


test is scored R — 4W, for no stated reason. 


as com = Whom does it favor? i . 
l4. In the aig n ridi es TMC, items have five alternatives. What is the 


Corresponding scoring formula? 


15 T 
` When scores are "corrected for guessing, 
tive score, What does this mean? Is he less a 


' some person may receive a nega- 
ble than a person scoring zero? 


16. Compute scores for each of the following persons by the usual correction 
formula: 
T A has 20 right, 6 wrong, 7 omitted. 
nudi, lace B has 22 right, 8 wrong, 3 omitted. 
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Test 2, three-choice. C has 15 right, 6 wrong, 4 omitted. 
D has 18 right, 3 wrong, 4 omitted. 


Test 3, five-choice. E has 20 right, 6 wrong, 9 omitted. 
F has 6 right, 6 wrong, 23 omitted. 


17. Give a difficult five-choice test (untimed) to a friend with instructions to s 
items only when fairly certain of the correct answer. When he has finished bin 
test, provide a pencil of another color and direct him to answer all the dh 
maining items, making the best guess he can. Determine his raw score on Pda 
trial with and without correction for chance. If the test manual included "s 
not guess" directions, how much would he gain or lose by guessing desP! 
the directions? 

18. If you were taking a five-choice test of professional knowledge in psychology 
as a requirement for a diploma, would you mark items of which you were Y 
certain? Assume that the test is scored R — 1W. idly 

19. In a time-limit test of mental ability using multiple-choice items, how rap! pl 
should the subject work, in view of the fact that higher speed leads to mo 
errors? dl: 

20. Some instructors advocate scoring achievement tests by formulas which pen?" 
ize guessing very heavily, such as "Number right minus twice number we 
What effect would this have on validity of measurement? (Cronbach, ! 
Etoxinod, 1940). 


21. Should test directions tell what scoring formula will be used? 


MOTIVATION FOR TAKING A TEST 


; Joad of 
In making a physical measurement—for inst kload 
wheat—there is no problem of motiv 


when we put him on the scale we get 


ance, weighing a truc us 
ation. Even in weighing a m 
a rather good measure no matter a 
he feels about the operation. But in a psychological test the subject mus 


place himself on the scale, and unless he cares 


not be 
about the result he can: 
measured. 


Incentives That Raise Scores 


In an ability test, our problem is like th 
wants a high rate of production. Effort an 
ward the person foresees. The most direct 
is being hired for a job or being given ó 
erful and more universally available as a source of motivation is “ego invo ni 
ment," that is, the desire to maintain self-respect and the respect of othe? 


. " k / '0D7 
Effort is stimulated also by sheer interest in the task and by the habit ore? 
forming to authority, 


The test score is not readily altered by 
been many attempts to raise test perform 
monetary payments for increases in score. 


i " who 
at of the industrial manager 


Bs 
d productivity depend on hl 
reward for good test performa p^ 
a desirable assignment. Equally po 


have 
simple incentives. There 
ks, 2? 


ance b rizes, pep tal 
y pr pep tempt 


Almost invariably, such at 
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fail to produce appreciable improvement on ability tests over the scores 
earned under the regular conditions of administration. (See, for example, 
Benton, 1936; Ferguson, 1937.) These incentive studies generally offered 
rewards for compliance with the demand of an authority who wanted the 
test scores, When the motivational pattern is shifted to arouse the subject's 
Own concern over his test score, we find that scores can be improved. Flana- 
gan (1955), in a study of a large number of aviation cadets and high-school 
Students, compared evidence of careless and unmotivated performance 
under various conditions. His evidence was obtained by counting, first, the 
number of cadets who used stereotyped patterns of marking the answer 
e ABA BA B, and second, the number earning 
Chance scores on easy items. Even though the tests affected the cadets' duty 
assignment > completely meaningless responses were found. 
Ons orici i un aite nly typical of da entire series of tests, two 
Cadets per ibo showed stereotyped patterns, and five obtained chance 
Scores, High-school students were given a similar test with no particular in- 
Centive, merely being told that research data were being collected. Here 
there were four stereotyped-response papers per thousand, and 21 chance 
Scores, But in another school where the students expected that they would 
receive a full report on the tests together with counseling, there were no 
Stereotyped responses, and only three chance patterns per thousand. It is 


evident that research employing tests has little meaning unless the subject 
aking the test. If he is merely asked to co- 


ay be casual or even careless. 

a good impression on an adult is 
evelops attitudes toward himself 
a profound influence on his re- 


e typical middle-class child 


sheet, such as the sequenc 


55 given a personal reason for t 
®perate in an experiment, his responses m 

Motivation to do a task well or to make 
earned, During his early years, the child d 
and toward task performance which have 


SPonse to tests and to school assignments. Th e : E 
arns to work hard because he obtains praise, tangible rewards, and specia 


9Pportunities when he achieves well. The lower-class child very often learns 
to take assignments less seriously and to work barely enough to keep out of 
trouble, His self-respect depends most on his relations with his classmates 
Sutside of school, and relatively little on obtaining approval from adult au- 


t ik. 
horities (Eells et al., 1951). 


Motive; That Reduce Scores 
do poorly on a test (see Pollaczek, 1952). 


here are times when pupils try to limit their scores on mental tests because 
à school has a classification plan in which, it is rumored, the better students 
Will be required to do extra work. If, in military classification, men suspect 
that Passing certain tests qualifies one for an unpopular assignment, there is 


The subject may frankly wish to 
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a temptation to fail deliberately. Another instance is the boy who deliber- 
ately failed his school subjects so that instead of being promoted he would 
be kept in the grade where his less intelligent friends were to remain. 

When the subject wishes to earn the best score he can, his very desire to 
do well may interfere with good performance. When one is tense, he com- 
mits errors that he would readily detect as such otherwise. In psychomotor 
tests, tension leads to poor coérdination and erratic movements. In a verbal 
test, the subject who fears criticism of his answers may attempt to escape jt 
by being overcritical of himself. In clinical mental testing, anxious patients 
frequently find fault with their own answers or elaborate them to include 
all possible variations and qualifications. In doing this they may spoil an an- 
swer that would have received credit. Anxiety over tests is generated at an 
early age by the attitudes of teachers, parents, and other children. Sarason 
and his associates developed a special questionnaire to measure "test anx- 
iety," using such items as "When the teacher says she is going to give the 
class a test, do you get a nervous (or funny) feeling?" (Sarason et al., 1958): 
Substantial individual differences were found which are fairly stable, as 
shown by retests at a later date. The median elementary-school child admits 
to 12 anxiety symptoms on the list of 43 covered by the test. Anxiety i» 
creases gradually through the school years. It is especially interesting to note 
that the anxiety scores have only very slight negative correlations with abil- 
ity. Evidently, test anxiety is about as common among the very able pupils 
as among the dull ones. 

The detrimental effects of anxiety may be increased by the very tactics 
the tester uses to elicit the subject’s best efforts. Sarason and his associates 


(1952) used his questionnaire to identify Yale freshmen with high and low 
anxiety (HA and LA grou 


Ps, respectively). These groups were divide?’ 
Half the students received “ego-involving” (EI) instructions which stresse 


that these were intelligence tests and would be used to assist in interpreting 
freshman entrance tests. The NEI (“not ego-involved”) group, on the con 
trary, was told that the examiner was standardizing some tasks and that p 
one would examine individual standings. The test was a stylus maze; p^ 
five trials were given. Error scores are shown in Figure 6. We find that the 
NEI groups had intermediate scores, there being little difference betwee? 
the HA-NEI and LA-NEI subgroups. In the low-anxiety group, El instruc” 
tions had a small, generally helpful effect. The subjects with high anxiety 
about tests, however, did much worse when threatened by the importance? 
of doing well than they did under emotionally neutral conditions. 

That anxious and defensive reactions interfere with test efficiency ; 
also shown in a study of student nurses (G. Wiener 1957). Student nurses ? 
the top and bottom extremes on “distrustfulness” were selected by a qu 
tionnaire. Each student took sections of the Wechsler intelligence scale. d 
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Picture Completion test asks the subject what is missing in a picture (e.g. 
one eyebrow in a sketch of a face). Distrustful subjects were inclined to 
deny that anything was missing when the answer did not come to them im- 
mediately, Likewise, on a test of similarities ("How are praise and punish- 
ment alike?") the distrustful students were more inclined to deny that the 
Words were alike. The difference, though significant, was small; distrustful 
students averaged 2.7 suspicious comments on the two tests compared to 0.9 
for the trustful students. To measure the effect of suspiciousness on scores, 
Wiener compared scores on the PC and Similarities tests with a vocabulary 
Score that presumably is not affected by suspiciousness. This comparison 


12 


High Anxiety 


10 
Group 


Low Anxiety 
Group 


8 8 
? 2 
86 96 
Ww w 
4 4 
2 2 
0 0 
1 zo Wd 45 
1 2 3 4 5 Trial 


Trial 
FIG. 6. Maze performance under ego-involving (ED and neutral (NEI) instructions (Sarason 
et al., 1952), 


shows that extreme suspiciousness lowers the IQ by about three or four 


Points, As Wiener says, “People who say, ‘There is nothing missing in that 
Picture! are spondtäg to internal needs rather than to the testing situation. 

: d necessarily lowers scores. 

g: a delinquent fears that his punishment 
hild fears that a poor intelligence rating 
h their affection; a college girl fears 


wee are ever present in testin 
Ul depend on the test results; a C 


will q; ae 
al disappoint his parents and diminis à : à » 
at failure will force her to leave her campus friends and return to the farm; 


an anyi : im insane. Fears such as these 
anxious patient fears that a test will prove him “ 


can be jj : triking example is the case of the young re- 
E isted without end. A str J «me of war, who failed his physical 
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tional concomitant than is poor thinking, but the disrupting physiological 
responses have their mental counterparts. 

Insofar as the tester can convince the subject that the tests will be used to 
help him, not to harm him, the validity of scores will be increased. m 
phasis must be placed on the positive use of results. A job applicant fear "" 
of failing an aptitude test can be given to understand that test scores may 
indicate a field where he will succeed. A patient fearing the verdict of a di- 
agnostic test should understand that it will point the way to a cure. 


22. Mandler and Sarason (1952) comment, “It is questionable whether intelligence 
test scores adequately describe the underlying abilities of individuals with a 
high anxiety drive in the testing situation.” On the other hand, it can be argue A 
that a person who is not motivated to avoid failure will perform below his bes 
level. Which argument seems correct? How could you test whether anxiety 
lowers or raises ability scores? 

23. Hebb and Williams (1946) devised a test to measure the intelligence of Le 

The test consisted of a set of mazes to be run, success being scored if a are 

path to the foodbox was taken. What problems of motivation would need to 

considered in administering this test? m 

In an "agility" test used by the British Armed Forces at one time, each man W 

tested separately while his squad of perhaps twenty others watched. The tas 


. in 
called for running back and forth along a cross-shaped pattern, transfert! 9 
rings from one post to another. 


a. What effect on score would be expected from being tested in a grouP 
rather than without an audience? o 
b. What effect would be expected as a result of announcing each man's scor 
at the end of his trial—to be applauded if good? -— 
c. What advantage or disadvantage would a man have who came last in t 
group? ime 
25. If, on a personality test, a person reveals something discreditable about Dim 
self, can one suggest any reason other than a strong desire to be honest? 


24. 


Preparing the Subject for the Test 


The motivation most helpful to valid testing is a desire on the part of ve 
subject that the score be valid. This is not the normal competitive set ee 
one desires a high score whether it is true for him or not. It is a scient " 
set, a desire to find out the truth even if the truth is unpalatable. Ideally. th 
subject actually becomes a partner in testing himself. his 

Too often an autocratic approach is followed, something like “Take t! . 
test and I shall decide what is to be done with you." Most testers would dis 
claim any intention of dictating, yet it is true that tests have most often Ium 
used for the private information of the tester, who then bases recomme? 
dations on them. a 

Coöperation between tester and subject is not an impossible goal. m 
chotherapy is based on diagnostic testing; decisions of school administrat? 
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depend on standard tests; the employment manager must take responsibility 
for hiring the best-qualified applicants. Responsibility cannot ordinarily be 
transferred to the person tested, but the subject can be made a member of 
the tester's team. The tester can take him into confidence as to the purpose 
of the testing and portray the test as an opportunity to find out about him- 
self, just as the physician often tells the patient what medicine is being given 
and what good results are to be expected from it. If the subject knows what 
à test is measuring and why a fair measurement is to his advantage, he will 
have little motive to provide an untruthful picture. Perhaps the most "auto- 
cratic” of the current uses of testing is in industrial hiring—necessarily so, 
Since the goal of testing is profit to the firm. Yet the tests given in the hiring 
line are to the advantage of the person tested, and it will build good will if 
he knows it. The very facts regarding turnover that lead the employer to 
Screen applicants are facts which would reassure the worker if he knew 
them. If he does well on the test, he can have confidence that he will make 
good on the job. If he does badly, he is unlikely to last on the job. The failure 
9n the tests saves him from wasting time in a dead end; he can begin in- 
stead to accumulate experience and seniority in another job for which he 
is fitted, 

The desirability of preparing the subject for the test by appropriate ad- 
vance information is increasingly recognized. It was formerly the common 
administer a test battery routinely to every 
Person coming in, and to use the test results as a basis for the first counseling 
Interview, Now counseling more often commences with one or two inter- 
Views which help the person define his problem. The interview gives him a 
more realistic understanding of what tests can do, reduces anxiety about 
the test results, and helps in the choice of tests. Another type of indoc- 
trination is fund in some of the great nation-wide testing programs like that 
of the College Entrance Examination Board. Booklets have been prepared 
for both the Scholastic Aptitude Test and the subject proficiency tests. The 

Ooklet describes the test, gives advice on efficient work procedures, and 
Provides specimen items. This information increases the applicant's confi- 
dence and reduces the disadvantage which an applicant inexperienced in 


taking standard tests might otherwise have. 


Practice in counseling centers to 


w in testing be adopted: 


26 H i i 
«Eo cone ive" t of vie 
w could a "coóàperative" poin divide his eighth grade into sections on 


a. By a school principal who wishes to 
the basis of intelligence? " 

b. By a mbes cauneston who must approve the piet of a handicapped 
veteran to go to college and prepare for dentistry? ; á 

€. By a consulting psychologist who is asked by a social agency to diagnose 


and tial delinquent? . 
27. Wher Sn DR E ge give the subject in each of the following cases? 


9. College freshmen are to be tested to determine which ones may fail be- 


cause of reading deficiency. 
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b. At the end of a course in industrial relations for foremen, an examination 
on judgment in grievance cases is to be given. 
28. ls it ethical, in a test of emotional adjustment, to phrase directions so that the 
subject believes his imaginative ability is being tested? 


Coaching and Test Sophistication. Preparation may be carried to extreme 
lengths. In Great Britain, a test given near the age of 11 determines what 
type of secondary education a child will receive. This is a fateful decision, 
opening or closing the gate to most professions and to financial and social 
status. Parents, concerned to help their children, often pay private tutors 
to prepare the child for the examination by special after-school lessons. In- 
deed, it has been said that in some districts two-thirds of the candidates 
receive such "black-market" coaching. The school system, unwilling that 
these children should have an advantage, then may introduce a special 
"coaching class" during the term preceding the examinations. Coaching for 
the arithmetic and language tests consists chiefly of additional drill. The 
third portion of the examination is a test of general mental ability. Coaching 
may include study of tests used in past years, practice on reasoning problems 
used in typical mental tests, and instruction on how to solve test problems 
rapidly. 

Preparation of this sort guarantees that the coached pupils perform at their 
best, but perhaps spoils the test by giving them an improper advantage over 
uncoached pupils. To evaluate any such procedure, it is necessary tO con- 
sider the distinction between intrinsic and extrinsic aspects of the test per 
formance (Gulliksen, 1950). The test is used to decide which pupils wil 
profit most from a later educational program. Any ability which aids pet 
formance on the test and in the later instruction also may be called intrinsiC, 
whereas an ability useful only in the test is extrinsic to the decision being 
made. Coaching which improves the performance intrinsically is fair, an 
does not spoil the test. Teaching extra arithmetic gives the pupil an advan 
tage on the test, but this extra training presumably will also make him a bet- 
ter student. Teaching him how to solve mazes, however, is beneficial only i? 
tests presenting maze items; it cannot help his later schoolwork. 

There have been many studies to measure the effects of coaching, and the 
studies differ in procedure and results. Some very large gains in score were 
found in studies where subjects were initially almost completely unfamiliat 
with objective speeded tests. Recent British results (Alfred Yates et a^ 
1953, 1954) show what can be expected among reasonably well-educate 
pupils today. Gains are measured by repeating the same test after the €* 
perimental interval. According to these studies, 


"Control" groups gain (on the average) about 2-3 points in IQ, merely 
as a result of taking the first test. 
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"Coached" groups gain about 5-6 points, after having been told about 
tests and having had numerous representative items explained by the 
teacher. 

"Practiced, uncoached" groups gain 
four to eight tests without special explanation. 

"Practiced and coached" groups may gain 8-10 points. 


It is noted in all these studies that a very extended period of practice or 
Coaching is no more helpful than a few sessions. Gains such as those shown 
in these studies, and in the corresponding American studies (Dear, 1958), 
are relatively small. While they might make the difference between success 
and failure in obtaining admission to a higher school, for a pupil near the 
borderline, coaching will not raise a poor college prospect sufficiently to help 


hi Y ik 
im over the examination hurdle. 


about 6 points, after taking from 


29. What implications does the British investigation of coaching have for Ameri- 
cans who use mental tests to select scholarship winners? 

In Japan a young man's career opportunities depend very much on his ability 
to capture one of the limited number of openings in the University. Vacancies 
are filled on the basis of entrance examinations and school records. Maga- 
zines bearing such titles as Student Days, Examiners’ Circle, and Period of 
Diligent Study have large circulations. These magazines deal with topics of 
interest to candidates including information about typical test materials 
(though the actual test questions are of course guarded). Would such maga- 
zines increase or decrease the validity of the tests? . 

In planning a competitive mental test to be given all Japanese youth applying 
for higher schools, two policies appeared possible. One was to devise new 
types of test items each year, so that knowledge about previous examinations 
wouid be of no help. The other proposal favored using the same types "i 
questions every year (for example, number series) but chanigingjtie tetas use s 
Compare the plans from the point of view of the test maker, the student, an 


the person i i It: 
interpreting the resu'i5. . . 

Which of thase Bee of preparation for a scholastic aptitude test leads to 
Fe in intrinsic ability? 
a. Vocabulary-building exercises. 
b. Advice about whether to guess when in spud 

€. Therapeutic counseling to reduce fear of failure 
33. | quacy. 
* In some college residence 
Fd the point of view of the 
oes this increase or decrease 


30. 


31. 


32, 


and feelings of inade- 


uestions from past examinations. 


halls, students file q 
rse year after year, 


professor teaching the cou 
the validity of measurement? 


Testing Procedure as Standardization of Behavior 


We may understand better the problem of framing directions and arousing 


Proper motivation if we realize that the psychometric tester tries to stand- 
ardize the behavior of the subject, a$ well as the test stimuli, Even though he 
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is measuring individual differences, his procedures are designed to eliminate 
individual differences—to eliminate, that is, variation in every characteristic 
save the one that his test is supposed to measure. 

To clarify this, consider the physiological measure of basal metabolism 
rate. If a doctor wants a BMR measure, he requires his patient to fast for 
eight hours before the test. This eliminates differences in cating habits 
which would affect oxygen utilization. For the test itself, it is necessary tO 
reduce the patient’s bodily activity to an absolute minimum by putting him 
into bed; every patient is, in effect, reduced to a standard activity level. The 
BMR, calculated from the oxygen intake and the carbon dioxide exhaled, is 
a useful measure of the patient’s physiological state. This measure is taken 
in an artificial “standard condition” which almost never occurs in real life: 
The person’s metabolism rate as he goes about his daily affairs is not much 
like his BMR, since it is affected by his eating, activity, and other variables: 

Psychological tests are similarly designed to extract one variable, purified 
as much as possible, from the total life activity. The psychologist is concernec 
if some students fail to understand his directions because this irrelevant dif- 
ference will affect his results. He is concerned if some students receive 
coaching, if some are especially anxious about the test, if some interpret the 
test as a speed test while others think carefulness counts most. All thes? 
sources of variation blur his measurement. He tries, in setting the stage for 4 
test, to reduce all his subjects to a “standard state” of motivation, expecta" 
tion, and interpretation of the task. 

An example of such standardization is found in certain tests intended to 
measure personality traits. One might evaluate these qualities by observing 
behavior in everyday affairs. The meaning of this behavior is uncertain, how 
ever, since different subjects may be trying to do quite different things- 
the situation is more definitely structured so that all subjects have the same 
goal in mind, differences are more certainly attributable to personality: FoF 
this reason, many tests of persistence, reaction to frustration, flexibility’ 
a edge semen iy. The ject i a 

ated just as for an ability test. He does not realize t 
the tester will pay attention chiefly to how he goes about the task. 


TESTING AS A SOCIAL RELATIONSHIP 


The tester has been accustomed to think of himself as an unemotional, m 


partial task-setter. His traditions encourage the idea that he, like the physic? 


scientist or engineer, is “measuring an object” with a technics] tool But he 
A app? m 2 e ai :ca 
object" before him is a person, and testing involves a complex psycholog' : 


relationship. The traditional concern with motivation and rapport reco 


nizes this fact but, as illustrated i F ; nor 
, ted in the foregoing sections, leads to little P ai 


than a recommendation that the tester be pleasant and encouraging a 
a 
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help the subject understand the value of the test. This, we are beginning to 
suspect, barely touches the real social-psychological complexities of testing. 


As Schafer (1954, p. 6) says, 


The clinical testing situation has a complex psychological structure. It 
is not an impersonal getting-together of two people in order that one, 
with the help of a little "rapport," may obtain some "objective" test re- 
sponses from the other. The psychiatrie patient is in some acute or 
chronic life crisis. He cannot but bring many hopes, fears, assumptions, 
demands and expectations into the test situation. He cannot but re- 
spond intensely to certain real as well as fantasied attributes of that sit- 
uation. Being human and having to make a living—facts often ignored 
—the tester too brings hopes. fears, assumptions, demands and expecta- 


tions into the test situation. He too responds personally and often in- 
ality and in fantasy—in that situation, 


tensely to what goes on—in re 
personal response from the patient, 


however well he may conceal his 
from himself, and from his colleagues. 

est almost invariably is in difficul- 
e authority who demands that he 
another authority to rebel against. 


. The subject coming for an individual t 
Hes. He may have been referred by som 
= ie if so, the tester may be simply à TENE 

Yer subjects are self-referred. One might expect coüperation in such a 
Case because the subject is asking for help, but he too may come with mo- 
tives which conflict with the tester's objectives. The very fact that he has 
had to Seek psychological help may disturb the person who wants to be in- 
dependent, He may have doubts regarding his own adequacy which he 
Attempts to suppress by every available strategy. It is commonplace to dis- 
“over, behind a college student's self-referral for remedial reading or voca- 
tional counseling, a problem of sexual adjustment or emotional conflict with 
Parents. The student, by focusing his attention and that of the psychologist 
9^ a superficial or nonexistent problem, is using an unconscious sleight-of- 

and to conceal the problems he does not want to face. 

Instead of being hostile and resistant, the subject may present himself as 
riendly ang totally submissive. This can go far beyond the normal, mature 
Act-giving which the tester wishes for. Some subjects "turn themselves over" 
to the psychologist thereby avoiding responsibility for their own problems. 

One of us is willing to expose himself completely, or even to learn the 
Whole truth about himself, yet the job of the tester is to penetrate personal 
Secrets, In clinical —— and interviewing particularly, the psychologist 


really tries to bring to the surface the whole personality—sexual attitudes, 
feelings of inadequacy hostilities and wishes the patient is ashamed of, and 
30 on. Even when the tester has a much more limited aim, the patient may 
elieve that his intimate desires and anxieties will be exposed by the tests. 


* popular literature on psychology and psychiatry being what it is, the 
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subject may expect the psychologist to be almost pruriently concerned with 
tabooed areas. Or he may view the tester as a modern magician from whom 
no truth can be hidden and whose every judgment is beyond question. 

These attitudes define a role which the tester is expected to play, and 
the tester's own self-conceptions define another. When tester and subject 
meet, therefore, their mutual demands may support each other, or they may 
pull in opposite directions. A client who wants to escape responsibility may 
fall into the hands of a tester who likes to pose as infallible and to dominate 
others. This tester is unlikely to sense that the client's seeming passivity !5 
just a strategy adopted to keep the tester from probing into an unstate 
problem. The situation is little better if the tester is one who, because of self- 
doubt, cannot comfortably take responsibility. Pressed by the client to make 
a definite recommendation, this insecure tester will retreat from responsibil- 
ity. He will pile test upon test, so that the mass of data will relieve him of the 
burden of judgment. He will qualify his interpretations and obscure them in 
technical jargon to frustrate the client’s unacceptable demand. Finally, 
he terminates the counseling with “All tests can do is give you a basis for 
making your own decision.” By this he avoids a counseling relation—longe 
more intimate, but uncomfortable—in which he could bring the client t? 
understand his passivity and hesitancy. 

Schafer points out that the tester chooses his profession because it satisfies 
his needs. The tester may be one who feels inadequate in social relations 
but who can obtain reassurance from seemingly objective instruments: 
may prefer the brief and distant contact of objective testing to the demant" 
ing personal relations that teachers and therapists have. He may be answer 
ing doubts about himself by comparing himself favorably at every turn bon 
those he tests. On the contrary, instead of having these remote and eom" 
petitive attitudes, he may be one who seeks grateful and dependent rea 
tions from subjects. 

All these patterns can distort testing procedures and test interpretation 
The overly “objective” tester may be unwilling to give the subject the emo" 
tional support required to reduce resistance and elicit his best performance 
He may overemphasize difficulties that can be treated unemotionally (lim 
ited vocabulary, for example) but overlook emotional needs. The compe" 
tive tester may be too ready to identify weaknesses, or to describe subje^ 
he admires as having virtues he hopefully sees in himself, (Wilson te p 
that, when he trained a group of intelligent convicts to give a performan? 
test to new inmates, he had to supervise constantly to prevent their mak g 
procedural errors to reduce the subject’s score and so magnify their own” a 
periority.) The tester who seeks emotional support from patients may pe o e 


lenient and encouraging, and all too willing to overlook weaknesses 7 
record. 
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Granting that both tester and subject come to the situation with a full com- 
plement of human motives, some of which they are not aware of, what 
should the tester do about it? At this point, with research on these motives 
almost entirely lacking, we can make only common-sense suggestions. The 
first of these is "Know thyself.” The more the tester knows of his own per- 
Sonality, of his preferences for different types of subject, and of the biases he 
brings to test interpretation, the greater the chance that he can meet each 
situation properly. The second suggestion is Schafer’s recommendation that 
the social situation itself be considered an important way of understanding 
the subject, and that his strategies, demands, and resistances themselves be 
taken into account in interpreting scores. His view, which is the only thor- 
Ough statement on the problem yet attempted, is well summarized in this 


Paragraph (1954, pp. 72-73): 

There are those who would object t 
violates the objectivity of test interpretation. Only in the narrow and 
false sense in which objectivity has been usually conceived is this true. 
The ideal of objectivity requires that we recognize as much as possible 
What is going on in the situation we are studying. It requires in particu- 
lar that we remember the tester and his patient are both human and 
alive and therefore inevitably interacting in the test situation: True, the 
further we move away from mechanized interpretation or comparison 
of formal scores and averages, the more subjective variables we may in- 
troduce into the interpretive process. The personality and personal lim- 
itations of the tester may be brought into the thick of the interpretive 
Problem. But while we thereby increase the likelihood of personalized 
interpretation and variation among testers, we are at the same ume ina 
position to enrich our understanding and our test reports significantly. 
The more data we use, after all, the greater the richness and specificity 
of our analyses—and in the long run the more accurate we become. 


hat this total-situation approach 


Schafer’s view obviously demands impressionistic interpretation, and is 
not fully acceptable to psychometric testers. His view need not be accepted, 
Since no evidence is offered that these complex interpretations can indeed 

€ made accurately. Those who reject Schafer's recommendation must, how- 
- er, face the problem of interpersonal dynamics and find their own solu- 
tion. Even a strictl y poker-faced administration of an individual mental test 
1S an hour-long stress situation, eVery moment of which involves emotional 
Interactio nd subject. 
: rom leues eder oint, a has been suggested that the effect of 
ester Personality is merely that different testers obtain different average 
Tesults, IQs obtained by one tester average a few points higher than those 
Of another. Rorschachs given by tester X contain more "movement" responses 
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than those of tester Y. Why not, then, “calibrate” each tester as a laboratory 
thermometer is calibrated, so that his errors can be compensated for? If we 
know that, on the average, his Wechsler IQs average 2.3 points higher than 
those of other testers, we can adjust his reports. This is not a realistic sug- 
gestion. Calibration requires an overwhelming amount of research, and at 
best such a correction deals with the average error rather than the om 
which vary from case to case. Individual tests cannot be standardized pie 
enough so that all testers will obtain identical results; the best hope is tha 
careful training of testers can remove most of their consistent errors. 


34. In what way could sympathy and love for children bias a tester? What poris 
of the testing process would be affected by this bias? fi 

35. If social factors and examiner differences affect individual tests more tha 
group tests, does this imply that group tests are better measuring instrumen 

36. Does a formal and impersonal attitude toward all subjects standardize t 
testing relationship? 


Suggested Readings 


Biber, Barbara, & others. Stenographic record of psychological examinati, 
Life and ways of the seven-year-old. New York: Basic Books, 1952. Pp. 631- 
This is a record of remarks made by both examiner and subject before ve 
during a series of performance tests of mental ability, including the Fue 
maze and several formboards. Note the many places where the examiner in 
willing to digress from the test into other conversation in order to maint@ 
rapport. des 
Bingham, Walter V. Administration of tests, and Giving group tests. Aptitt 
and aptitude testing. New York: Harper, 1937. Pp. 224—244. of 
These common-sense Suggestions, based on long experience, will be for 
great value to beginning testers. A translation of a German checklist o 
observing behavior during the test, used for an impressionistic evaluation 
performance, is included. 
Schafer, Roy. Interpersonal dynamics in the tes 
pretation in Rorschach testing. N 


Ina thought-provoking discussion of the motives with which the tester 


subject approach each other, Schafer speculates regarding defenses ! 


D oh 

z : hic 
tester's personality (dependence, overintellectualization, sadism, ete.) W 
may reduce his effectiveness. 


in intel 
t situation. Psychoanalytic ” 


ew York: Grune & Stratton, 1954. Pp. 6-7 ani 


"ilg Th. 
Thompson, Anton. Test-giver's self-inventory. Calif. J. educ. Res., 1956, T> 67 od 


A checklist pointing out nearly fifty specific practices that characterize ! s 
test administration includes numerous techniques that practical expert 
shows to be advisable, which inexperienced group testers tend to overloo* 
Wilson, Donald Powell. My six convicts. New York: Rinehart, 1951. 
A best-seller describes the ex 


ce 


ch Br 
de^ : periences of a psychologist doing resear alue 
drug addiction in a prison, using convicts as testing assistants. Of special e of 
are Chapters III and IV, describing how the team overcame reluctanc 


i i gue 
convicts to take tests. See also, on P- 235, the explanation in convict la 
of “the coefficient of correlation.” 


| 4 
Scoring 


s 
CORING PROCEDURES 


= ee has tried to understand why he received a low score on 
swer aor "x must realize how difficult it is to define a good an- 
Starch and p etermine the proper credit for a partially correct response. 
of impressi liott (1912, 1913) provided conclusive evidence on the faults 
lish I END scoring as long ago as 1912. They presented a pupil's Eng- 
teers to Le to a convention of teachers and asked a number of volun- 
to 98. st a it. Ona percentage scale, the grades assigned ranged from 50 
ing a com ev idence of disagreement could perhaps be dismissed, since judg- 
iome EAR iig is influenced by preferences for various styles. To drive 
Way, point, however, they had a geometry paper graded in the same 
8 to 92, presumably because of variation in 

artial solutions, etc. 
Sünd s re; an be done nor ga we hope to reach 
is to E actical decisions if scoring standards vary erratically. One solution 
Pl P rules for judgment which all scorers se follow. The other 
answer. 4 is to use recognition items where the subject is to choose the right 
; this eliminates all judgment from scoring, once the initial key is 


Agreed 
upon by competent persons. 


The sc 
the e E scores ranged from 2 

MA dde 
No edit given to neatness, p 
Scienti : 
cientific research on behavior ¢ 


Seoti 
ring of Free Responses 
use problems calling for some degree of 


can be devised which permit fairly ob- 
f behavior. Ayres, for example, 


ju sponte testing continues to 
lective - mn scoring, but methods 
Produc: Scoring of the important features of De . 
bi iim à guide for scoring pupil handwriting (Figure T). Samples of 
Cates noe representing various levels of quality ARE given; the teacher lo- 
e sample most similar to the pupils work in order to determine his 


lis " à d 
re, Product-rating scales can be developed for judging quality of sewing, 
65 
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shopwork, etc. Objective methods have not been completely — 
scoring verbal tests, but variation among scorers is reduced by guic shear 
show the approved scoring for representative answers. Noteworthy pee ap 
are the scoring manual for the Stanford-Binet test of intelligence ( 


FIG. 7. 


e 
ell 509 

Part of Ayres' scale for scoring handwriting samples. (Copyright 1212, d 

Foundation. Reproduced by permission of the present publisher, Educational Testing Ser 


est 
and Merrill, 1959) and the volume by Beck (1944) on the Rorschach " 
of personality. ecial 
While the scoring guide is adequate for most testing of individuals, «d se 
precautions must be taken when free-response tests are used to d n one 
perimental treatments. The scorer who believes or wishes to prove the th 
treatment is superior may unconsciously tend to give higher scores 7 


h bias 
subjects who had that treatment (Goodenough, 1940). To prevent such itt 
it is necessary to mix all records to 


who does not know which 
called “blind” scoring. 


scot 
gether before presenting them to 2 ^^. js 


re 
group any person belongs to. This procedu 


ra 
t Ir i n 9 
1. The question "Why should People wash their clothing?” is to be used in O" c 
intelligence test for adults, 


5 rel 
to test comprehension of common situations- so cleo! 
a set of standards for judging correctness of answers. Make your rules 
that scorers would be able to agree in scoring new answers. 


Scoring of Recognition Items 


a 
p . "m ect 
The scoring guide for a recognition test consists of a list of the cor d 
swers and a scoring key which a clerk can use. Several efficient pro^ 


e 
> costs are reduced because the same booklet can by 
repeatedly, and the answers can eas 


machine. The carbon booklet consist 
per and a hidden under 
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sealed together, and the subject marks his choices on the face-page. His 
marks are traced by the carbon onto the bottom page. The scorer tears open 
a perforated edge of the booklet to reveal the bottom page on which printed 
Squares show where the correct answers should appear. It is a simple matter 


HIDEPHUEII 
nilbibsbiiblebbtii 
ZHHIHIPHHL HH 
aiiai apirip eigin 
siiagi sb htt bot Piel 
SHIP HILL IHE 
dibbibsdilberilll 
ej LEE tagna ed EEO 


Portion of answer sheet for machine scoring. (Courtesy International Business Machines 


FIG. 8, 
Corporation.) 


to count the number of carbon marks falling within the squares. The pin- 
et. Instead of checking his an- 


ii method is similar to the carbon book! ; E 

er with a pencil, the subject sticks in a pin at that point. Squares are 

Printed on the back of the page so that when the booklet is torn open the 

number of holes falling within squares indicates the score. 

. Machine Scoring. The scoring machine most widely known is that de- 
eloped by International Business Machines in the late 1930's. The subject 
lackens an answer space with a soft pencil. Electrified “fingers” in the 

Machine sense where pencil marks appear, since the graphite in the marks 

Carries current. A meter shows the total number of properly placed marks. 
lé machine will report number of errors; rights-minus-wrongs, and other 
Pes of scores. Under ideal conditions, it can score as many as 500 papers 

Per hour accurately (Traxler, 1954; Lindquist, 1951, pp. 408 £.). Military 


“sification centers rely on the machines to process recruits. Large school 
tests for the entire system. In most 


:« available where tests from 
tt e country, a test- 
ered schools may be machine-sco 


ine main difficulty with the IBM machine, 
Thinistrators is that it cannot score accurately unless answer spaces are 


Neatly blackened by the student. While the machine is supposed to deliver 
he unit of credit Jn enever a mark appears in the right space, whether the 
ark is light or heavy, wide or narrow, in practice this cannot be expected. 
© mark must be Meine with the proper sort of soft pencil, and to be sure 

i eing counted must fill completely the space between the dotted lines on 
* ànswer sheet, Furthermore, Stray pencil marks and smudges due to un- 


Sca 
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> ; 
tidy erasure register and are counted as errors. The number ine T 
blackened or untidy papers is so great that scoring agencies have ueni 
examine each paper before feeding it to the machine; if necessary, ie irá 
blackens more heavily where the student used a faint mark to udin 
answer, and erases stray marks. This adds appreciably to the cost of scoring 


FIG. 9. (Courtesy International Business Ma- 
chines Corporation.) 


IBM test-scori ng machine. 


The need for such re-markin 


r 
rop? 
g can be almost entirely eliminated by P 
test administration 


and proctoring. de 
Newer electronic scoring machines are becoming available. One 
oped by the University of Iowa for its hi 
a photoelectric “reading” device with 
1954). Responses to as m 5e 
n 
Ju? 
Jones, he blackens J in the first c9 0 
a 
(es 0 
and converted scores nd 
art scores and weighted composites can be obtaine™ 
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many computations desired for research can also be carried out at the same 
time. 

We are beginning to see automation in testing itself, as well as in scoring. 
This is particularly demonstrated in the application of "Skinner box" tech- 
niques to human subjects. As used in the psychological laboratory the Skin- 
ner box has taken the form of a cage in which the rat, pigeon, or other animal 
activates a mechanism bv making à particular response: striking a lever, 
tapping a spot on the cage wall, etc. Rewards for correct performance can 
be administered automatically according to any desired schedule—for exam- 
ple, by dropping a food pellet into the tray at the end of one minute if the 
lever has been struck during that minute. Skinner (1958) is now adapting 
the Same principle to the study of arithmetic performance of children. The 
child responds to problems presented by the machine, pushing a button. to 
indicate his answer; a correct response is rewarded by a signal. The machine 
puts the child through an automatic drill, and at the end delivers a record 
showing the child’s rate of response and his accuracy. 

P In a mental hospital, Skinner and Lindsley arranged a room where pa- 
tients receive rewards for pulling a lever (Lindsley, 1956). The rewards m- 
clude cigarettes, a brief look at an entertaining picture, or (as a social re- 
Ward) the opening of a window where they can watch one of the doctors 
Working at his desk. In this device also, an automatic recording machine 
races the rate of response and thus provides a performance record which 
might have diagnostic significance. One of the most striking features of the 
method is that it provides a completely nonverbal test. The subject can be 
troduced into the test room with no instructions whatsoever and left to 
Iscover for himself what happens when he pulls the lever and to respond 


to 
the reward in his own manner. 
2. If j ion, it will be possible to ad- 
' IF Skinner’ sald i information, I 
P s procedures yield i t A T A d 
minister the “test” cutout to mental-hospital patients. space qpes 
tions of cost, what are the advantages and digi > e: ic testing, 
» m 1? 
aS compared with face-to-face testing by 4 clinical lon wali — 
eëxamine the directions for the TMC (Chapter 3h she t a aciodil ? 
" directions to make sure that students blacken answer s es H i y i 
* What effects upon the character of tests and their use mig t dup to ol- 
ow from the availability of a machine which makes it possible to obtain vir- 


tually an unlimited number of scores from a single answer sheet, at negligible 
Cost? 


nterpretable 


I 
NTERPRETATION OF SCORES 


Raq 
W Scor 
es 
ical report of a person s performance called 


Most 
i tests yiel irect numer 3 ; 
IS raw s bei ue - acm be the number of questions he answered, the time 
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he required for the task, or some similar number. Because raw scores = 
readily available, and familiar from long experience in classroom examina- 
tions, many people interpret them without realizing their limitations. An 
example from the old-fashioned report card will demonstrate the problem. 
Willie brings home a report showing that his average in arithmetic is 78, 
and his average in spelling is 90. His parents can be counted on to praise the 
latter and disapprove the former. Willie might quite properly protest “But 
you should see what the other kids get in arithmetic. Lots of them get e 
and 65." The parents, who know a good grade when they see one, refuse to 
be sidetracked by such irrelevance. But what do Willic’s grades mean? It 
might appear that he has mastered three-fourths of the course work in arith 
metic, and nine-tenths in spelling. But Willie objects to that, too. “I learne 
all my combinations, but he doesn't ask much about those. The tests are ul 
of word problems, and we only studied them a little.” Willie evidently get 
75 percent of the questions asked, but since the questions may be easy a 
hard, the percentage itself is meaningless. We cannot compare Willie one 
his sister Sue, whose teacher in another grade gives much easier tests 50 tha 
Sue brings home a proud 88 in arithmetic. It could be, too, that willie’s shin- 


ing 90 in spelling is misleading, if the spelling tests deal with the very gue 
assigned for study. 


A raw score on a psycholo 
can be interpreted only b 


marks (1948, p. 83) illustrate this point: "In American college circles, 


s 8 
yards in 994, seconds reve?” , 
Cs) M jon 

ri grounds there is no occas” 4.. 


asoni”? 
a 
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raw score of 40. The test does not include the problems everyone can solve, 
however; if people were tested on every possible problem calling for rea- 
soning, the true ratio might be 140 to 180, or 1040 to 1080. Even an infant, 
looking toward the door when he hears his mother’s footstep, shows some 
es of ability to reason. Absolute zero in any ability is "just no ability at 
all” 

Differences in raw scores do not ordinarily represent “true” distances be- 
tween individuals. Suppose, on DAT Mechanical Reasoning Form A, Adam 
gets 53 points, Bill gets 56, and Charles gets 59. The raw-score differences are 
equal. Is Charles truly as different from Bill as Bill is from Adam? We cannot 
be sure, since the score difference depends on the items used. Judging 
from the published norm tables for twelfth-graders, if these same boys took 
the Bennett Form AA the “equal differences” would be replaced by un- 
equal ones: Adam would get 44 points, Bill would get 45, and Charles 48. 
agfully talk about “equal differences” is to 


The only way one can meanir 
ving in some practical criterion which provides a standard of value. Differ- 
cal scales for the same test. On 


ent standards will lead to different numeri 

the DAT the three boys’ raw scores are equally spaced. Their probabilities 

of Passing a college engineering course may be .70, .90, and .96, respectively. 
cir most likely freshman grade averages may be D, C+, and B-. And 

their respective probabilities of later success in a very demanding engineer- 

ng firm may be .0001, .05, and .50. “Equal intervals” on one of these scales 


a : 
te quite unequal on the other. 
our alternatives: 


Having scored a test, the tester has f 

9 He may compare the score directly to some accepted standard of per- 
formance, For example, a school may admit to the first grade all children 
who earn a certain predetermined score on a readiness test. 

* He may compare the score to other scores in the group tested. 

* He may compare the score to scores in a reference group by means of a 
table of norms. 

9 He may use an expectancy table to es 


su 
bsequent performance. indivi ; 
Of these methods, the most common is to compare the individual with 


a reference group. The tester refers to a table in the manual to learn what 
he normal range of performance is. More than that, he converts the raw 
Score into some type of derived score which is a permanent record of the in- 
Vidual’s relative position. The most common types of derived score are 
Percentiles and standard scores: n 
Many statistical methods used by the test developer are simple enough to 
e followed by the test user, who can prepare norm tables or expectancy 
tables forhi: own group. Expectancy tables, which we consider first, require 


Ro more than simple tabulation, and calculation of percentages. 


timate the individual's probable 
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5. Decide whether an absolute zero exists for each of the following variables and, 
where possible, define it: 

. Height. 

. Ability to discriminate between the pitches of tones. 

Speed of tapping. 

- Gregariousness, seeking the companionship of others. 
e. Rifle aiming. : 

6. If several pupils in Willie's class move away and are replaced by newcomers, 
will his raw score in arithmetic probably change? His rank in his class? -— 

7. |f a different set of test questions were used in arithmetic, would Willie's T? 
score change? His rank? dis 

8. Alfred, a college freshman, is to receive guidance on his academic plans, 9" " 
given four tests of ability. Scores are presented in four different ways. InterP 
separately each row of scores. 


aara 


Mechanical 
Vocabu- Verbal Nonverbal Compre- 
lary Reasoning Reasoning hension 
Raw score 116 32 44 48 
Percent of possible points 77 73 80 n 
Points above average 24 10 20 0 
Rank among 260 freshmen 104 113 161 136 


y , uce* 
+ Two runners train for the mile. One, between his junior and senior years, jupe 
his time from 4:16 to 4:04. The other starts with a time of 5:16. What time 


2 first 
he achieve for us to say that he has made as much improvement as the 
runner? 


Expectancy Tables 


e 
The expectancy table is a useful device for interpreting performance: je 
test developer or test user administers the test to a large number of pers? 
and subsequently observes their success; these results can be tabulate i0 
form an experience table such as Table 1. This table is based on applicat i- 
of a general scholastic aptitude test (the Ohio State University psychology 
cal Examination) to 920 freshmen at Ohio State. To interpret a tna 
score, the counselor need only direct attention to the row of the table a 
sponding to the score; the entries show how likely the student is to attain ^ 
particular grade average. This explanation is more definite and more ^ ay? 
plete than can be offered by any other system of norms. As Bingha™ “he 
(1951, p. 552), “The counselor of an entering student who has scored ?! 
lowest decile range (lowest tenth) on this test can now show him thes? 


i é ing 
pectancies, and point out, if it seems advisable, that his chances of keep 


he 
s à . . that 
off probation ( Point-Hour Ratio = 1.50) are a little better than even; t vent 


has one chance in a hundred of earning high honors; and that in a? : a 

much depends on the persistence and strength of his own determinat? 

powerful factor not measured by this or any other psychological test; gh? 
Expectancy data may also be presented in charts like Figure 


n 


om^ 


SCORING 73 


TABLE 1. Expectancy Table for First-Semester Freshman Achievement 


Score on OSU 


Psychological Probability of Earning a Point-Hour 


Ratio of at Least 


Test 
" 1.50 
s aw Percentile 1.00 (Proba- 2.00 3.00 
core Rank (D av.) tion) (C av.) 2.50 (B av.) 
ur 90- 100 99 93 80 56 
Sod 13 80-89 100 96 9 60 30 
-101 70-79 100 95 90 60 29 
2M 60-69 99 90 78 41 27 
PEG 50-59 98 87 74 25 13 
Pa 40-49 97 80 62 25 13 
6-65 30-39 96 79 61 17 5 
15-55 20-29 95 75 47 13 4 
AZ 10-19 95 63 33 7 2 
0-39 E 87 58 29 3 1 


Sounce: Bingham, 1951, based on data from G. B. Paulsen. 


able but is especially useful 


uS erent scores to laymen. T ated are based on three 
Dinter, tests; it can be seen that the dexterity test is a much less accurate 
inia than the other two. Besides interpreting scores for individuals, the 

ancy table gives information on the validity of a test (see Chapter 5). 


chart gi 
gives less precise interpretations than the t 
he charts illustr 


are clearly meaningful. Can ex- 


10, 

Expectancy tables prepared for local use 
Pectancy tables profitably be included in test manuals, in view of the fact that 
Probobility of success on a job depends on local conditions? 


Nterpret this information about scores of a prospective aircraft armorer, prior 
0; trade information, 140; nut-and-bolt 


to és 
t training: mechanical aptitude, 12 
est, 100. 
PROBABILITY OF SUCCESS 
as a Function of: 
Scor Mechanical Trade Wut: and Jet 
ë Aptituda Information Manual Dexterity 


140 Z 
^o TEN 


ing at least an average grade in 
1946). 


obability of earn 
| Classification Tests, 


Expectancy charts showing P" 
an aircraft armorer (Personne 
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Percentile Scores 


The easiest way to make comparisons is to rank the scores f rom x anm 
lowest. Reporting that a person stands third out of forty convenient je vá 
his position relative to others. Ranks, however, depend on the number ied 
sons in the group. If we wish to examine change in standing from a m 
sion to another we have difficulty because the size of the group d 
To avoid this difficulty, ranks are changed to percentile scores (also yet 
percentile ranks and centile ranks). A percentile score is the rank n he 
in percentage terms. A person’s percentile score tells what e ane oe 2 
group falls below him. Suppose there are 40 persons, 27 superior to A h hima, 
poorer. Then we arbitrarily divide case A (and all persons tied wit 


1. Begin with the raw scores 37 43 27 44 27 27 26 31 35 
(these are scores of 75 35 43 36 6 50 47 36 26 32 
ninth-grade boys on Ben- 36 21 24 40 39 35 38 36 38 
nett Form AA). 26 35 22 18 50 30 38 50 16 

34 26 34 28 41 27 39 4| 30 

22 31 36 40 54 24 22 8 33 

41 31 34 36 32 20 22 34 41 = Abe 

Highest score = 54; lowest score = 8; range ervah 

Class interval of 5 will be used. (A smaller o in- 

such as 2, would be preferable but would 

convenient in this computing guide.) 


- Identify the highest score 
and the lowest score, If 
there is a wide range, 
choose a class interval of 
1, 2, 5, 10, 20, etc., and 
divide the range into 
classes of equal width, Cumu- 
Fifteen or more classes Fre- lative 
are desirable, 


quency Fre- 
- Tally the number of Cases Scores 


i (f) 
with each score. 


- Write the number of tal- 75 
lies in the Frequency (f) ae 70 
column. Add this column ed 68 
to get N, the number of 0-44 56 

35-39 39 
cases. 30-34 

. Begin at the bottom of 25 25 
the column and add fre- ep TTY, — 15 
quencies one at a time to 15-1 THA ‘5 
determine the cumulative “19 2 
f 10-14 
requency, the number of 5-9 2 
cases below each division 0 
point. 

- Divide the cumulative fre- 
quencies by N to deter- 
mine cumulative percent- * 5 cases fall bel à 24.5; etc. cum 
ages. * 90. percent of the dui fait Psy 24.5; 20 iş Ba. 


ulative Percentage Corresponding to a raw score of 
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if any) between the two groups, saying that 27% cases are above him and 
12% cases below. Since 12% is 31 percent of 40, his percentile score is 31. 

By this method of computation, the person exactly in the middle of the 
Stroup is at the 50th percentile. The 50th percentile is called the median. 
The median indicates the performance of the most typical member of the 
Broup. 

A graphic procedure is often used to compute percentiles. The graphic 
method disregards irregularities in the distribution of scores in a particular 
Sample and therefore gives a better estimate of what may be expected when 
further groups are tested. Computing Guide 1 demonstrates this method, 


using a set of Bennett TMC scores for a ninth-grade class. 
Transforming raw scores to percentile scores changes the shape of the dis- 


7. Plot cumulative percent- 
age against score. (In 
Practice, a large sheet of 
Staph paper would be 
Used.) 


* Draw the smooth curve m 


which best fits the points 
Plotted, 


o 
o 
E 
c 
o 
g 
$ 
o 
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o 
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E 
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o 
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Raw Score 


20 


$ š tile Equivalents 
Determine the percentile Percent SM 


Equivalent i i 
i en e eret JN. O 
"ia is on the chart gum 
the ^" one finds that «setti 

Percentile equivalent 


ar A 
74) aw score of 40 is 


Co, 
MPUTING GUIDE 1. DETERMINING PERCENTILE EQUIVALENTS 
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tribution. In Figure 11, raw scores for the same ninth-grade class have p 
plotted. The distribution is high at the center and tapers away at each m ; 
When each score is changed to a percentile equivalent in the lower part x 
the figure, the distribution is nearly rectangular. With larger samples ue 
percentile distribution becomes almost perfectly rectangular. Persons near 


Frequency 
15 


10 


20 30 do 50 S 
j Raw; Score | N 
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FIG. 11. 


aa vem of the raw-score scale are spread apart; persons at the -— 
Fc. together, It is important to realize that a rather large 

ce near the median often represents a small difference P 
ance. Conversely, the difference between the 90th and the 99th P 
though it looks small on this scale, may be as gre: : he difference 
a five-minute and a four-min eee 


ute mile, 
Averaging two percentile scor 


be obtained if the avera 


erce” 


ou 
es gives a result different from E nged : 
| ha d 
ge of the corresponding raw scores wer? = pe ^ 
scores of 14 and 22 average 18, which "T is ja 


agin i r answ i 
aging percentiles, our ans eragi? jd 
arely makes a huge error by aye » 


ie ' g5» 
centiles, the cumulation of such errors can distort statistica “ps 


therefore percentiles should not 
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Percentile scor 

are mine jaa piir med om ip ira ik die ie 
similar dme) ai : ix aed norms for tests may be based on dis- 
Deen or Ai rdue egboard norms are based on industrial trainees; 
TSth air im oe gives norms for engineering freshmen. A person at the 
ee nm Ua tests according to these norms does not have equally 
damel a cual u ese two abilities. For the TMC, separate tables for in- 

a s are available, and we find that a score which is at the 78rd 


at the 90th percentile among trainees. Wher- 


ercenti 
e entile among freshmen is 1 1 
T percentiles ar 
percentiles are used, the norm group involved must be kept in min 


TABLE 2. Bennett TMC Norms for Boys 


in Grade 9 
22 54 
95 47 
90 44 
85 Al 
80 39 
75 38 
70 36 
65 35 
60 33 
55 32 
50 31 
45 30 
40 29 
35 27 
30 26 
25 23 
20 22 
15 20 
10 17 
5 14 
1 5 
Number of cases 833 

Mean 30.8 
s.d. 10.4 


Sounce: Bennett, 1947. 


In t 
" he manual for the Bennett TMG, the user finds a collection of percen- 
to compare his subject with various 


Ue 
co s 
Dversion tables permitting him 
reproduced here. This can be used 


eren 
ce groups. One such table is 
although the tables are ar- 


ike 4 
ting Guide 1, 


he te 
Tange © table prepared in Compu 


d diffe 
rently. 
ercentile in the norm table, even though it 


Sc 
is a of 39 falls at the 80th p 
arer the median (68th percentile) of the small class (Computing 
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Guide 1). The median of the small class (34) is higher than the pem » 
the standardization group; the class is evidently a superior group. This ¢ ; 
onstrates the value of carefully collected norms. A person who is just ue 
in mechanical comprehension would not be especially encouraged to dps 
a mechanical field. The average student within this class, however, in^ 
to be superior (63rd percentile), compared to students generally. It i$ 
larger group with whom he will compete after high school. 


all 
12. Interpret the following record of ability test scores for one person, ret 5A; 
scores are percentile scores based on a random sample of adults: Verba', 
Number, 46; Spatial, 87; Reasoning, 40. sor Gs 
13. Estimate Alfred's percentile score in each of the four tests he took (quest! 
p. 72). -— 
14. Why does the table of Bennett norms begin at 1 and stop at 99, ins e 
ranging from 0 to 100? What percentile score corresponds to a raw sco 
60 (perfect)? - of 
15. Scores usually change when a test is repeated, because of chance €" own 
measurement. If each of the following persons changes two points UP er 
in raw score on the TMC, how much would his percentile score change 
a. A person with a percentile score of 55 on the first test. 
b." A person at the 95th percentile on the first test. ons 10 
16. The scores below are the times, in seconds, required by a group of pue 
perform an easy Block Design problem. Prepare a table of percentile ed 
lents for this group: 
52 34 4l 42 46 45 27 48 35 35 38 29 54 36 7 
48 39 44 36 36 34 51 40 30 33 37 41 56 32 3 
37 28 28 45 31 39 31 27 35 36 34 42 38 33 


39 28 36 33 37 36 34 54 34 32 33 38 e in sec 
17. According to the table prepared in problem 16, how much difference 
onds does a difference of 10 percentile points represent? 
Standard Scores 
the 


Mean and Standard Deviation. The second common way to summari/? ne 
performance of a group is to use the mean and standard deviation: gs 
mean (M) is the arithmetical average obtained when we add al ) i5 j 
and divide by the number of scores. The standard deviation (s.d-, ° : p 
measure of the spread of scores. The variation of two sets of scores in de 
different even though the averages are the same. Figure 12 RE ve? 
smoothed distribution of scores of two classes taking the same e at aJ 
though the groups are similar in mean ability, the distributions are - pe 
alike. Group B contains far more very superior and inferior cases ap j 
fore has a larger standard deviation. lined p 

One method of computing the mean and standard deviation is out : wh? 
Computing Guide 2. The complicated formula makes it hard to 5€? inti” 
the standard deviation means, but in effect it is an average of the 


+ Begin with the raw scores 27 
(these are scores of 75 35 43 36 26 50 47 36 26 32 32 38 
ninth-grade boys on Ben- 36 21 24 


nett Form AA). 26 35 22 18 50 30 38 
34 26 34 28 4! 27 39 41 30 23 33 


36 42 Al 
32 20 22 


2. Identify the highest score Highest score = 54; lowest score — 8; range — 46. 
and the lowest score. If Class interval of 5 will be used. (A smaller interval, 
there is a wide range, such as 2, would be preferable but would be in- 
choose a class interval of convenient in this computing guide.) 

1, 2, 5, 10, 20, etc., and 


divide the range into [re 
classes of equal width. quency 
Tallies (f) d fd fd 


Fifteen or more classes Scores 


are desirable. 


5 Tally the number of cases 
with each score. 


d Write the number of tal- 
lies in the Frequency (f) 25-29 10 
Column. Add this column — 20-24 10 —2 —20 40 
to get N, the number of 15-19 /// S ed =t H 
cases, 10-14 cA =0 E. 


e i any interval, usu- 
ally near the middle of Sfd X 
the distribution. Call this NM Au 
the arbitrary origin. 

(Here, the 30-34 interval 

is used.) 

Determine the deviation d of eac! 
arbitrary origin. 


h interval from the 


6. Multiply in each row the entries in the f and d col- 


umns, and enter in the fd column. 


d and fd columns, and T 


* Multiply the entries in the 
A colum d the fd and fd? col- c7 75 = .24 


enter in the fd? column. Ad 
umns, (3 is a symbol meaning "sum of." M = 320 + 5 (0.24) 

; = 32. .20 = 33.2 
8. Substitute in the following formulas: M = 320 + 1.20 3.20 


€ (correction) = P e 290 — 75 (0.24)? 
sd. = 5x 74 
M (mean) = A.O. +i xc 290 — 75 (.058) 
sd. = 5x af 


sd. = 5x V3.86 = 5 (1.96) 
s.d, = 9.80 


e-interval selected 


A.O. is the midpoint of the scor : 
f the interval. 


95 arbitrary origin, and i is the width o' 
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of persons' scores from the group mean. We might measure the spread of 
scores by finding how far each person is from the mean and averaging 
(ignoring the direction of deviation). For mathematical reasons, the stand- 
ard deviation formula takes the average of the squares of the deviations 


50 60 70 80 90 40 50 60 70 80 90 100 
Scores of Group A Scores of Group B 


FIG. 12. Distributions of scores of two classes on the same test. 


rather than of the deviations directly, and then takes the square root of the 
result. 

The standard deviation indicates how much variation there is within a 
group. In much statistical analysis the square of the standard deviation, 
called the variance of the distribution, is used as an index of variation: 


18. a. Compute the mean and standard deviation for the Block Design scores 
given in problem 16, p. 78. 
b. How does the mean compare with the median computed previously? 
c. What is the approximate percentile rank for a score 2 s.d. above the m 
in this distribution? 


ean 


Conversion Scales. We can replace the person's raw score with a derived 
score showing his position relative to the mean. To say how far above oF d 
low he is, we use the standard deviation as a unit. We can say that one a $ 
son is 2.5 s.d. above the mean and another is at —1 s.d. (i.e., is 1 s.d. below to 
mean). From Computing Guide 2, we see that a score of 43 is about 1 Bor 
above the mean, for example. Derived scores based on standard deviatio” 
units are called standard scores. ; 

Computing Guide 3 shows how to convert raw scores to a standard-sc0r® 
scale with a mean of zero and each s.d. above the mean counted as one a 
One can also convert scores to the “T-score” system which sets the mean * 
50 (to avoid negative scores) and each s.d. equal to 10 points. As pignre 
shows, changing raw scores into standard scores does not alter the for™ 
the distribution (except for slight changes due to regrouping ). 

Whereas the Bennett TMC presents norms in percentile form, d 
Wechsler Block Design norms are in standard-score form (called "sca Je 
scores" by Wechsler). As an example, Table 3 gives the norms for pe a 
aged 20-24. The range of converted scores is from 0 to 19, because wer 
ler chose to set the mean equal to a standard score of 10, and to €? 
each s.d. above or below the mean as 3 standard-score points. 


he 


Frequency 


15 
10 |- 
5l 
pe DET 
0 
d 20 30 | 40 50 60 
Raw Score 


29 30 40 


F " 
IG. 13. Distributions of raw sco 


l. Begin wi 
Verted ith the raw scores to be con- 
asin C and find the mean and s.d. 
n Computing Guide 2. 


* To Obtai 
Score tain z-scores, express each raw 
Divi as a deviation from the mean. 
ide by the s.d. 


raw score — mean 
standard deviation 


Z score — 


EE . 
Mn T scores, multiply the 7 
y 10 and add to 50. 


10 (raw scor 


res and standard scores for the 


e == mean) 


60 80 


T Score 
same group. 


For the data in Computing Guide 2, 
M = 33.2, s.d. = 9.80 


For raw score 50: 


For raw score 25: 
25 — 33.2 _ 
980 


—82 


See 


9.80 


z= 


For raw score 50, z = 1.7 
T= 50 + 10 (1.7) = 67 


For raw score 25,z= —8 


T score = 50 + 


Com 
PU 
TING GUIDE 3, CALCULATION OF STANDA 


standard deviation 


T = 50 + 10 (—.8) = 42 


RD SCORES 
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TABLE 3. Standard-Score Equivalents of Raw Scores for the Block Design Test, 


Ages 20-24 

5 a aan 
Scaled score 0 1 2 3 4 5 6 7 8 T at 
Raw score o 1 2 3-8 9-12 13-16 17-20 21-24 25-28 29- 


Scaled score 10 11 12 13 14 #15 16 7 18 19 
Raw score 32-34 35-37 38-40 41-43 44-45 46-47 — 48 — — 


Source: Wechsler, 1955, p. 103. 


One can develop standard scores using other values for the mean and s.d. 
Table 4 summarizes several standard-score systems now in use. While there 
have been logical reasons for many of the variations, only confusion results 
from so large a variety. It is now recommended (Technical Recommenda- 
tions, 1954) that test developers use the T-score system, with mean 50 an 
s.d. 10. If it is desirable to keep converted scores below 10 so that they will 
fit into one column of a standard punchcard for statistical operations, the 
stanine scale (pronounced stay-nine) is recommended. The z conversio? 18 
used in statistical and theoretical work, but is not often used by test inte 
preters. The remaining systems may be expected to die out in time. 


19. The 1959 Stanford-Binet test fixes the mean IQ at 100 and the standard devia- 
tion at 16. Express in T-score form the following IQs: 100, 84, 132, 150. 


? 
20. In Computing Guide 2, what standard score corresponds to a raw score O 40 
48? 4? 


21. Draw a figure to show the relation between raw scores and T scores in Com- 
puting Guide 3. 

Smoothed Score Distributions. The frequency distribution shown at the tp 
of Figures 11 and 13 is jagged, but if more cases were added and smaller 
class intervals were used it would become relatively smooth. We can " 
mate the most likely shape of that distribution by drawing a smooth curve ; 
shown in the top portion of Figure 14. This distribution is not perfecty 


TABLE 4.  Standard-Score Systems 


Standard Standard 
Mean s.d. Score Cor- Score Cor- 
Set Set responding responding Name of System, Remarks 
Equal Equal to ] s.d. to 2 s.d. 
to to Above Mean Below Mean 
ical 
(o) 1 1 —2 Z scores, prominent in mathemo! 
theory of testing 
5 2 7 1 Stanine scores 
10 3 13 4 Used for Wechsler subtests T 
50 10 60 30 T scores; most widely used syste ental 
100 15 or 16 115 or 116 70 or 68 Deviation IQ used by many m 
tests Joy” 
100 20 120 60 Used for aptitude tests of U.S. EmP 


Frequen 
e 
15 y 


30 40 50 
Raw Score 


P4 


pz 


Normalized z Score 


20 
30 40 50 
Normalized T Score 


FIG, 1 
Sym 4. Smoothed distribution of raw scores and distribution of normaliz 
metric; i 
his ii but it tails off on both sides. Most tests yield distributions of 
Some M character. Since every distribution has its own shape, there is 
distribution so” in converting the score scale so that every test has the same 
ion form. The normal probability curve is used for this. 


he 
ina Probability Curve. Th Higee 155) ís & smooth, 
ic frequency curve which t mathematical properties. 


ed scores. 


e normal curve ( 
has importan 


0.2% 


portion of a normal 


e of cases falling in each 


FIG. 15. Percentag 


distribution. 
from the mean to the “point of inflec- 


on ; : 
Point the shoulder of the normal curve: This, as shown in Figure 15, is the 
hill-like portion from the concave tail. 


Th which separe tes the convex, d 
en f probability and is used in 


The Stan 
tio, » an dard deviation is the distance 
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statistical analysis to determine whether a particular experimental result may 
be a chance occurrence. 

Many biological measures such as heights of American men fall into ? 
nearly normal distribution, perhaps because chance combinations of chromo- 
somes determine the variable. In psychological tests also, it is very common 
to obtain normal distributions of scores. Early investigators thought it ? 
natural law that abilities are normally distributed. It is now realized ud 
such a statement is meaningless, since the shape of the distribution depends 
on the scale of measurement. The distributions of actual test scores mu 
on the way the test is constructed. By selecting items suitably, we can chang 
the score distributions to U-shaped, flat, skewed to one side, etc. (F. M. et 
1952; Cronbach and Warrington, 1952). The use of normal curves in ws 
scaling is therefore merely a convenience and is not based on any “nora? 
distribution of behavior” in nature. 

If we slice a normal distribution into bands one standard deviati 
a fixed percentage of the cases always falls in each band. As Figure 
shows, 34 percent of the cases fall between the mean and +1 s.d > to 
next interval are 14 percent, and in the third 2 percent. Since 99.6 perce? “eS 
the cases fall between +8 s.d. and —8 s.d., the whole range of test Sc 
is somewhere near 6 standard deviations (less, when the group mo hly 
These facts are handy for interpreting standard scores and for roug 
reconstructing the score distribution if the mean and s.d. are know, kly 

Whenever we have, or assume, a normal distribution, we Can [one 
convert standard scores to percentile scores, and vice versa. Below the ™ d. 
(z score of zero, or T score of 50) are 50 percent of the cases. Below + 


[-4 S P 
are 50 + 84 or 84 percent of the cases; hence a T score of 60 equals ? : 
centile score of 84. 


m wide, 


the 


er- 


a 
i ? To 
What percentile rank corresponds to a score of 2 s.d. above the me^" 
score 1 s.d. below the mean? 
e a . T 
i In a normal distribution, what is the relation of the mean and median! .. 60 


. Assuming that scores are normally distributed on a test where the meer gr 


22. 


m E is 8, interpret the following scores: Sara 64; Harriet, 68; Charles 

o D 7 , 

, 48. » 
25. Using Figure 15, interpret each of t p 


he following in percentile terms: i 


of 3.0; a z score of —2.0; a T score of 40; a T score of 65. 


Normalized S. if a j 
cores. Scores are somewhat easier to interpret if 2 ste 


are reduced to a scale having a known distribution, For this purpose t ov 
most commonly employ normalized standard anne These scores ar anf 
tained by stretching a distribution to make it nearl imd and then chi 
ing it to standard-score form. One procedure nd accomplishes this Y pe? 
is to compute percentiles by the method of Computing Guide 1, 2” 
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TABLE 5. Relations Between Standard Scores and Percentile Scores, When Raw 


Scores Are Normally Distributed 


Scores Are Normally Distributed 0 


ibas Percent Distance 
from Mean of Cases from Mean 
meg, til in “Tail” | Percentile insid; 
lz Score) T Score bo , of Curve Score T Score (z Score) 
: o 3.0 
i 0.1 2 zi 
80 99.9 0.1 . 

i 79 99.8 0.2 0.2 21 —29 
= 78 99.7 0.3 0.3 22 —28 
27 77 99.6 0.4 0.4 23 —27 
i 76 99.5 0.5 0.5 24 —26 
ud 0.6 0.6 25 —2.5 
5 74 os 0.8 0.8 26 —24 
28 73 99 1 1 27 —23 
a 72 99 1 1 28 —2.2 
n 71 98 2 2 29 —2.1 
20 2 2 30 —2.0 
"m p 7 3 3 31 19 
18 68 96 4 4 32 —1.8 
Uu 67 96 4 4 33 —17 
NG 66 95 5 5 34 —16 
15 7 35 = 
i4 a En : 8 36 =14 
M 63 90 10 10 37 —13 
12 Š 12 12 38 —12 
Á a 2 14 14 39 zd 
Lo 16 40 —10 
n » n i E 18 41 —0.9 
0.8 59 = A 21 42 € 
07 = "m S4 24 43 F 0 7 
- 56 73 27 27 = 
0.5 1 31 45 —0.5 
0.4 55 69 aA 34 46 = o 4 
03 54 66 a Eh 7 a. 
0.2 53 62 aR 2 48 255 
Ql A p 46 46 49 -01 
20 : pe 50 50 50 0.0 


sí "ead from Table 5 the standard score corr 
ue, 
ae ninth-grade TMC scores, 95 is the percen- 

indica kat the corre- 
€ equivale: f 50, and Table 5 indicates t 

t of score OF OV, ; 

Ponding pinata if ai is 66. This compares with a T score of 67 (not 
Normalize d) santa in Computing Guide 3. Such small changes, stretching 


Out th cessing it at the lower end, produces 
e : ompressing PRECES 
ied Scale at the upper end and c T peo 14. This distribution is more 


istribution shown at the bottom © 
YMMetric than the raw-score distribution. Jf more cases had been used, the 


Oothed distribution would be comp 


I r 
til ? Our illustrative distribution 


letely normal. 
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Profiles 


e test 


With derived scores, it is possible to compare performance on on 
a 


with that on another. This is illustrated in the Differential Aptitude Tests, 
set of eight tests for different abilities. One section is a new form of the Ben- 


Verbal Numerical Abstract Space Mechanical Clerical Spelling Sentences M 
95 10 l4 


R + NA 
q 


FIG. 16. i i " T 
nes — of Robert Finchley on the Differential Aptitude Tests (Bennett, Seashore, 


nett TMC. The tests vary in length and in difficulty, so that from the t 
scores alone one cannot judge the person's greatest ability. After raw sco : 
are changed to percentiles or normalized standard scores, one C4” plot d 
profile showing his relative standing in all fields. The rofle shown in Fig - 
16 is that of a high-school junior; the norms for i s were used t° er 
vert his scores. Robert is highly superior in the various Minos tests, P * 
almost equally outstanding in all of them. His last three scores are quit? p 


Comparison of Systems 


Since the manual sometimes offers more than one type of conversio? 
and since the user often has to develop local norms Fil tests, he needs 
basis for deciding which system of scores is preferable ich 

The percentile score has these advantages: it is readil ‘understoo QW dis 
makes it especially satisfactory for reporting data to au without P 
tical training; it is easily computed; it may be hicgsted exactly j 
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When the distribution of test scores is nonnormal. The disadvantages of the 
percentile score are these: it magnifies small differences in score near the 
mean which may not be significant, and it reduces the apparent size of large 
differences in score near the tails of the distributions; it may not be used in 
Many statistical computations. 
canis of standard scores di as follows: Differences in standard 
sea bos proportional to differences in raw score; use of standard scores in 
‘ha ca and correlations gives the same result as would come from use of 

aw scores. The disadvantages are that standard scores cannot be inter- 
preted readily when distributions are skewed, and that untrained persons 
Senerally find them difficult to understand. In general, statisticians prefer 
Standard scores while those who interpret tests directly to laymen prefer 
Percentiles, 
a pun standard scores have become incre: pueden 
the dist A suitable. The normalized scores spread out cases e o E ra 
Sois x bution and yet can readily be translated into perce so he 

orm shown in Figure 16 illustrates typical current practice. One can 

read off standard scores (normalized ) when they are needed for statistical 
Comparisons but can talk to the subject in terms of percentiles. 


asingly popular and are 


res on class examinations so that he can tell at 
doing and can average all tests equally in the 


or standard scores? 
a “scholastic aptitude test." The 


26, 

A teacher wishes to convert sco 
» glance how well a person is 

inal grade. Should he use raw, percentile, 


27. x 
In a certain college, all freshmen are given plashi 
results are to be mimeographed and confidential copies given to all professors. 


F Should the report use raw scores, standard scores, or percentiles? 
* The Psychometrist gives a wide variety of tests to veterans needing counseling. 


After each man has taken from four to eight tests, results are to be placed on a 
standard report form so that performance on all tests can be compared by the 
Counselor, in conference with the veteran. What problems will be encountered 
if all results are reported in percentiles? If all results are reported as standard 


Scores? 


NORMs 

T eer ; 
he test manual assists the user to interpret scores by presenting information 
formation takes the form of one or 


Te d H ee, : H 
P agr. normal" performance. This in on á 
Nore tables, The user should have n° difficulty in interpreting the informa- 


tion Provided in the manual although every manual organizes its tables a 


it di 
differently. For example, the Bennett test m les of 
ble 2) so that the user may compare an individual 


centile equi 

A quivalents (cf. Ta 

W; E: A 

ith any one of the following groups: 833 ninth-grade boys, 870 tenthisgrade 
1836 candidates for policeman and 


o 
RA 613 engineering-school freshmen, à - j à 
eman positions, 548 candidates for apprentice training, 145 candidates for 


anual provides tables of per- 
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engineering positions, 1637 workers in a paper factory, 226 trainees in an air- 
plane factory, and fifteen other groups. The DAT form has separate norms 
for boys and girls, for each grade from 8 through 12. This battery is primarily 
used in high-school guidance. 

No detailed information about the Bennett norm groups is given; hence a 
user of the test can only guess whether his situation resembles that of the 
tenth grade or of the "engineering positions" in the table. These norms were 
published in 1940. They may be contrasted with the more modern descrip- 
tion given in the manual for the DAT Mechanical Reasoning Test (Bennett 
et al., 1959): 


Over one hundred school systems from all major geographic aer 
contributed to the normative sampling. In some cities the whole schoo 
population in five grades (eight through twelve) was tested; in some, 
all pupils in one to four grades were tested; in some larger cities, dla 
in representative schools (as judged by the local research director 
were examined. A complete listing of the normative sample showing 
the number of students in each grade in each community is [obtain® 
ble]. . . . The total number of students included in the present norms 
(1952) is over 47,000. 

The states which contributed to the normative study ..- and the 
number of communities in each were: Arizona, 1; California, 5; Colo 
rado, 1; . . . West Virginia, 2. The testing of the normative sample 
was done throughout the school year. It is appropriate to assume 
the norms represent mid-year performance. 


Some testers attach too much importance to norms, either when they select 
tests or when they interpret scores. Others, recognizing that norms * 
helpful, are unduly impressed by the number of cases used in compiling th 
norm tables. We shall see, however, that the size of the standardizing sa?P 
alone does not indicate how satisfactory the norms 


are. "me 
* A in 
Norms are unimportant in many uses of tests, particularly when one $ 
R de , 
tends only to identify individual differe xamp 


: nces within a group. For € 
norms are of little use to the employment manager who wishes to hire 


brightest ten persons in a group of applicants. Norms are also of little kat 

where a critical score is used. If a personnel manager knows from actual U E 
i — li ya 
that persons with scores of 72 on test A make satisfactory punch-press op? 

EA n ^ jm 

tors, it is not necessary for him to compare nal n0? ag 


I id 1 cl applicants with natio et 
i HEE e inica it is ex 2 " 
guidance and clinical work, it is extremely important to use nor! pe 


interpreting scores. The person's position relative to his group has t? p 
fixed as definitely as possible. A child who scores at the 20th percentile i 
a test of readiness for first grade will have difficulties in school, but hes 
no means a rare exception. If our norms placed him at the 2nd percentile 
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Stead, we would not expect him to fit into the regular program at all. 

For many test interpretations, local norms are far more important than 
large-group norms. Classes differ so much that a child who is at the 20th per- 
centile in the typical first grade would be at the 2nd percentile in some 
other school which enrolls pupils from a select neighborhood. One example 


of such school-to-school differences is given in Table 6. 


TABLE 6. School-to-School Differences in Mechanical Reasoning Scores of Ninth- 


Grade Boys 


Approximate 


City Name of School Number of Cases Mean sä; 
Worcester, Mass. Commerce 2. = 127 
orcester, Mass. Five other schools 2d à 
sr Joseph, Mo. Benton 70 Ho n 
t. Joseph, Mo, Lafayette 70 = 107 
St. Paul, Minn. wil 50 i ý 
n y DRE 175 37.6 12.4 


dependence, Mo. Chrisman 


Source: Bennett ct al, 1947. 
the tester to compare the subject with 
"Is prospective companions and competitors. The manual for ae — 
Intellicence scale gives norms based on adults in general. But a boy who is 
above the average for his age, compared to people in general, may be be- 
freshmen. If we wish to predict whether he can 
on college students alone. 


echsler norms based 
rms for the particular college he 


Norms, to be useful, must permit 


oe among college 
eed in college, we need W 
lore than that, we need to know the no 
Plans to attend. 
Sections of the country, occu 
Merica, One example of the geogra 
Biven in Table 7. The SSCQT was Ve 
bin might wish to attend college; those € 


pational groups, and schools vary widely in 
phical differences to be expected is 
n for several years to young men 


arning scores of 70 and over were 


TABLE 7, Geographical Differences in Selective Service College Qualifica- 


istrants Scoring Below 
Resi - ebd pi n 75 5 80 
4 44 71 92 
New England 1 3 40 69 92 
Middle Atlantic 1 y 45 74 94 
East North Central 2 5 44 73 93 
est North Central 1 11 57 78 94 
South Atlantic é 15 66 85 97 
East South Central 7 16 63 84 95 
est South Central ? 4 46 75 94 
Ountain 2 5 45 72 93 
acific 2 


Sounce; Statistical Studies, 1955, P- 89 


_ old men, yet he needs to compare the men he tests with a community 
age. 
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generally allowed to postpone their military service until completion of col- 
lege. The test, designed to be as fair a measure of scholastic aptitude as pos- 
sible, called for verbal and quantitative reasoning. There are large differ- 
ences among regions: in the Midwest and East the average registrant was at 
a high enough level to be exempt from immediate draft, whereas in the 
South only 40 percent of the registrants have performed at this level. 
Whenever he can, the test interpreter should prepare norms for the groups 
with which he deals directly. A high-school counselor could profitably use 
information about the score distribution for all boys in his high school, for 
boys in the shop curriculum, for boys who later attend the local college, a 
for workers in certain large local industries. He uses published norms be- 
cause it takes time and effort to accumulate local norms or because, 85 i 
often the case, he cannot possibly accumulate local norms. A clinician, or 


.year- 
example, has no chance to prepare norms for a random sample of 60-y E 
aver” 


mer als 
Except where the primary use of a test is to compare individu 


with their own local group, norms should be published at the time of 1€ 
lease of the test for operational use. Norms should refer to defined aa 
clearly described populations. These populations should be the group 
to whom users of the test will ordinarily wish to compare the perso 
tested. If appreciable differences between groups exist (e.g» gue 
differing in age, sex, amount of training, etc.), and if a person wou 

ordinarily be compared with a subgroup rather than with a ran 
sample of persons, then separate norm tables should be provided d 


manual for each group. [Technical Recommendations, 1954-] 


the 


These are official recommendations (see p. 101) regarding test norms: = 
these principles have been violated at times in the past. Tests have ~ ic 
published with no norms. Others have offered norms based on ina EN 
samples, and often the samples are improperly described. The difficulties i 
this aspect of test interpretation are pointed out forcibly in these remarks 


a test publisher (H. G. Seashore and J. H. Ricks, Jr., 1950): st 


Legitimate and illegitimate general norms abound in current ad 
manuals. People-in-general norms are legitimate only if they are “a 
upon careful field studies with appropriate controls of region” | “ne 
economic, educational, and other factors—and even then only * 
sampling is carefully described so that the test user may be fully ? 
of its inevitable limitations and deficiencies. The millions entering it 
armed forces during World War II provided the basis of some et? 
good norms on young adult men, though mainly on tests not availa fot 
the public. The standardization of the Wechsler Intelligence sea 
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Children is a recent attempt to secure a representative smaller sample 
of children aged 5-15 for setting up tables of intelligence quotients 
which may be considered generalized norms for children. The earlier 
work of Terman’s group to set up good national no 
chosen sample is well known. In the standardizing of some educational 
achievement tests, nationwide samplings of children of each appropri- 
ate grade or age and from different types of schools in all parts of the 
country are sought in an effort to produce norms that are truly general 
for a given span of grades or ages: 
Unfortunately, many alleged general norms reported in test manuals 
are not backed even by an honest effort to secure representative sam- 
ples of people-in-general. Even tens or hundreds of thousands of cases 
can fall woefully short of defining people-in-general. Inspection of test 
manuals will show (or would show if information about the norms were 1 
given completely) that many such massed norms are merely collections m 
of all the scores that opportunity has permitted the author or publisher- — 
to gather easily. Lumping together all the samples secured more by n 
chance than by plan makes for impressively large numbers; but while 
seeming to simplify interpretation, the norms may es or pera: dis- 
tort the counseling, employment, or diagnostic sign: cance o hs n" 
With or without a plan, everyone of course obtains data where an 
how he can. Since the standardization of a test is always dependent on 
the cooperation of educators, psychologists and personnel men, the fore- 
going comments are not a plea for the rejection of available samples but 
for their correct labeling. If @ manual shows general norms T a vo- 
cabulary test based on a sample two-thirds of which consists of women 


ffi <- pis test-wise eyebrows. There is 

Office workers an properly raise his tes 

uie oM I an s as a good generalization of adult—or 
a 


no reason to accept such nor ; 
€ven of empl E Jult—vocabulary: Tt is better to set up oe 
mp'oyec-? group and frankly call 


i f the 
Occupationally homogeneous two-thirds © : 
them norms KA LE office workers. Adding a few more miscellaneous 


al one. 
Cases does not make the sample à tal ae Jd reject as treacher 
As a rule, then, in reading à test manual we shou rej er- 
> > 


Ous any alleged national or general pnr S pum lg pede mi 
Ported by a clear, complete report on the wanes M dort 
resent, or norms which are obviously a won stir du | : 
samples weighted by their size according to chance ogic. 


rms on a small, well- 


Tf the manual describes the norm sample adequately, the user can judge 


© norms by these questions: 
irs Does the standard group 
jects should be compared? 


consist of the sort of person with whom my 
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e Is the sample representative of this group? 

@ Does the sample include enough cases? 

e Is the sample appropriately subdivided? 

Norms for any group must be a fair description of that group. A fair sample 
is assured when the test maker takes an exactly random sample of the popu- 
lation (e.g., of all American college freshmen ). Since this is difficult, the test 
maker usually tries to obtain a mixture of cases from all segments of the 
population; for college students, he would draw on large and small colleges: 
private and public, from all sections of the country. If any segment is too 
heavily represented, the norms will be biased. 

In a small sample, accidental inclusion of a few additional good or poor 
cases will make the norms unrepresentative. In large samples, such varia- 
tions should cancel out. No fixed number of cases is required for dependable 
norms. It is better to keep the sample strictly representative, and small, than 
to accumulate large numbers of cases which may not be representative. The 
most unsatisfactory norms are those based on whatever cases happen to be 
conveniently available. A manual may report, "The norms are based ie 
scores of 2700 sophomores taking general psychology at four Western CO 
leges.” Norms such as these are useless unless the tester wishes to know peni 
his cases compare with sophomore psychology students at western college? 

Even when the norm sample is large and representative, it should or zA 
narily be subdivided if important, clearly identifiable subgroups earn a 
ferent average scores. On the DAT Mechanical Reasoning Test, the Quiet 
median is 32 while that for girls is only 19. As Wesman (1949, p. 227) pos 
out: 5 

Counseling would be very different if one had only the single i 
scores [in the norm table]. For example, a boy with a Mechanic" 
Reasoning score of 40 (in Grade 10) would be close to the 75th pe 
centile on a combined distribution scale. With only that information” 
the counselor would be compelled to consider him as having enous 
ability to compete favorably in a curriculum or occupation requiring 
mechanical understanding. If he entered such a curriculum, howeve” 
his competition would be almost entirely male. Compared with lv" 
only, his score of 40 leaves him at the 50th percentile, a ranking not 
all superior. 


tly: 


" n ; . «ms 
This method involves calibrating a new test (or a test needing new nor? pe 


against another well-standardized test. This is similar to the procedure P 


A new method of developing test norms may become prominent shor 


: ! h 
maker of an aneroid barometer uses when he places marks on its dial so P" 
readings agree with an accurate mercury barometer. The method has d 
s " jen 
been useful in psychology because no instrument has had norms suffici 
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ipe to be taken as a standard tur other tests. It has been proposed, how- 
, to administer an experimental set of tests of various abilities and per- 
Sonality characteristics to a strictly representative sample of 500,000 high- 
School students. The proposed sample would constitute perhaps 5 percent of 
rim age group (ignoring the rather small fraction not in school). 
Tin developer of a new test for high-school ages could use these data as a 
ea 2 standardizing. To do so, he would select whichever of the experi- 
"es E tests measures about the same thing as his test does. For example, a 
Form DD of the TMC could be calibrated against whatever test of 
mechanical comprehension is in the experimental battery. Call this test X. 
The test developer would apply both test X and Form DD to the same sam- 
ple. This sample should be reasonably representative of high-school students 
= example, it should not be restricted to boys in a technical high school), 
it can be fairly small and need not be exactly representative. The 
equipercentile method is then used to establish what scores on test X and 
orm DD represent the same level of ability. Scores falling at the same per- 
centile in the calibration sample are taken as equivalent. Suppose we have 


t " 
he following information: 


Raw Score on Form 
DD Having Same 


Percentile Rank in Percentile Rank in 


Percentile Rank in 


Raw Score i 
Sampl rcentil 
Test X ii ere j Calibration Sample Calibration Sample 
80 9B 99 60 
60 82 88 52 
40 63 70 43 


One would conclude, for example, that a score of 60 on Form DD is equiva: 
ent to a score of 80 on test X. We therefore would expect it to fall at the 98th 
a national sample of ninth-graders. Sim- 


perc Aae a . 
il entile if it were standardized on 4 á ; “a 
arly, one can use the data for test X to establish norms for other grades, 
S - dtc a 
tudents in technical curricula, girls in vocational courses, etc. Once 
a a, 
X, any norms, expectancy 


Or 
table DD scores are matched to scores on test 
es, or other research on test X can be used to interpret Form DD. Certain 


tctistica] corrections may be necessary, however, pues Form: DD Sid 

ë f 

St X are highly correlated (Lindquist, 1951, pP- 750-760). 

est norms become obsolete and need to be checked periodically. Re- 
ts, for example, suggests that the 


ose for similar age groups 


Core. 3 
S of adults are, on the average ^o ó 
ed to an increasing level of edu- 


a de 
cade 
; ago. These changes ma be 
Cation, 8 ese changes may 
ed whenever a test is altered. Changes of 
asier OF harder and can even alter its 


ization test is made by cutting a 


Iti . 
Sis essential that norms be verifi 
Mar format can make the test € 

ning. The Crawford Structural Visual 


it 
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circular disk into nine pieces of irregular shape. The score is the time wer 
subject takes in fitting the pieces together. Originally this test was made o 
heavy aluminum. After it had been in use for some years, the cue t- 
began to use wood instead of metal. Psychologists who applied both old an 

new versions found that the mean time for the wooden test was 182 seconds, 
whereas the original test had a mean of 140 seconds (J. W. Wilson and E P 
Carpenter, 1948). The publisher had altered the test so that the publisher 
norms were now meaningless; it was a serious error not to revise the manua" 


the 
29. It might be appropriate to compare a high-school girl's performance on 


Mechanical Reasoning test with girls’ norms, boys’ norms, or the combine 
norms, depending on the decision to be made. Illustrate. , nd 

30. How do you explain the geographical differences on the SSCQT? Is it "s e 
national policy to encourage more students from one region to attend colleg 
than from another? 

31. Assuming a normal distribution, whot standard score corresponds to 
score of 40 in each of the schools in Table 6? . ering 

32. For Form CC, the difficult version of TMC intended for use in engin ie 
schools, separate norms are reported for 148 engineering freshmen ot Prin 
ton, and for four groups at lowa State College: 325 engineering freshmen er- 
agricultural engineering freshmen, 121 sophomores in architectural engine f 
ing, and all engineering seniors (108). It is reported that the subgrovP gw 
senior engineers were so similar that separate norms were not required: 
adequate are these norms? What other tables, if any, would be desirable? 

33. In a particular college which admits all high-school graduates who apply ed 
median score of the freshman class is at the 65th percentile of the pub aie 
norms for freshmen for the Henmon-Nelson test. What factors might ace 
for this deviation? 

34. A psychologist standardizes a primary intelligence test by testing ever 
entering the first grade in San Francisco during a particular year. 
a. For what purposes would these norms be valuable? first” 
b. Could equally satisfactory norms be obtained without testing every 

grader? 


r-o 
c. In what way would these norms be biased, as a sample of all ó-ye? 
children in San Francisco? 


a rov 


e 


y child 


e Ï 
sed 
uid 


“ : : ” : r 
35. A "music aptitude" test measures such factors as tone discrimination. The 


: B : U 
evidence that scores are increased by musical training. If the test is fO be 


mo o 
for advising college freshmen whether to study music, what sort of cases ah 


be used to establish national norms? dult 


36. How would you proceed to get an extremely representative sample af inot 
men in Chicago to use as a standardizing group for a mental test? Assume test 
you have sufficient research funds to pay each man $2.00 for taking ! S h of 

37. Would local norms or national norms be most useful in interpreting © 
the following? 

a. A personality test given to indicate whether a prisoner is psychotic 

b. An intelligence test given to an infant considered for adoption. " qua! 

c. A reading test given to determine if a high-school boy needs indiv! 
remedial instruction. 
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Suggested Readings 


Ba i 
auernfeind, Robert H. Are sex norms necessary? J. counsel. Psychol., 1956, 3, 
st reporting. 


57-63. Wesman, Alexander G. Separation of sex groups in te 
J. educ. Psychol., 1949, 40, 223-229. 
These authors argue as to whether one should present separate norms for 
each sex in test manuals. The two articles should be compared to note the 
points where the writers agree, and to determine why they come to different 


Fi conclusions. 
roehlich, Clifford P., & Darley, Jo 
results of a single test or measuring device. St 
Research Associates, 1952. Pp. 12-38. 
In extremely simple language the authors explain the computation of standard 
deviations, percentiles, and other statistics, and also discuss desirable qualities 


Li of test norms. 
orge, Irving, & Thorndike, Robert L. Procedures for establishing norms. Technical 
anial Lorge-Thorndike Intellige::ce Tests. Boston: Houghton Mifllin, 1957. 
p. 4-6, 
This compressed summary describes the extensive research conducted to 
tal tests. The procedures 


establish norms for one of the prominent modern men 1 
used to select communities for testing are unusually well designed. Results 


S show how norms depend on the socioeconomic Jevel of the community tested. 
te Harold G. Methods of expressing test scores. Test Serv. Bull., 1955, 

9. 45. (Available on request from the psychological Corporation.) 
A dozen scales for reporting tes d, including stanines, 


t scores are compare! 

percentiles, Colle ard 
t S, ge Board scores, A 4 ; 
raxler, Arthur E. Administering and scoring the objective test. In E. F. Lindquist 


sed), Educational measurement. Washington: American Council on Educa- 
ion, 1951. Pp. 329—416 ' . 

Besides Jeong with excellent illustrations, all the major techniques used 
for efficient soting of tests both by hand and by machine, the chapter dis- 


instructi llowed 
cusses 1 ing that standard instructions are fo : 
r S suring tha E . 
pee Ere p machine: an evaluation. Proceedings, 


Taxler, A j 

, Arthur E. The IBM test scoring : : " 

1958 Invitational Conference on Testing Problems. Princeton: Educational Test- 

i à 

RE Service, 1954. Pp. 139-146. -— 
raxler discusses the history and contribu 10 ; P 

dental questions as to the possible harm done by forcing all large-scale testing 
into the mold of five-choice multiple-choice items which fit the machine effi- 

ciently, Gives a realistic picture of the practical limitations of anomang test 

Scoring, Other papers in the same symposium describe superdevices some of 

which are still in the drawing board 558^ 


hn G. Statistical methods of summarizing the 
udying students. Chicago: Science 


of these machines, raising inci- 


Validity 


NEED FOR CRITICAL EVALUATION OF TESTS 


WHEN a teacher investigates the mental ability of his pupils, he looks for 
the best mental test available. An industrial psychologist selecting worker? 
for a factory wishes to try the best possible test of mental ability. The clinica 
psychologist studying a child who may be feeble-minded needs the menta 
test which will give the most accurate results. Each of these users therefore 
asks, "What is the best test of mental ability?" But the test which best serves 
one of these testers is probably not the best for either of the others: 

The purchaser of tests has a confusing problem. He is faced with long test 
and short tests, famous tests and unfamiliar tests, old tests and new tests, OP 
dinary tests and novel tests. The catalog of a leading test distributor offers 
tests of general mental ability and 19 tests of personality. Each of these jm 
was produced by a psychologist who thinks his test is in some way superio" 
to the others on the market. He is frequently correct. ^ 

Different tests have different virtues; no one test in any field is “the best 
for all purposes. No test maker can put into his test all desirable qualities 
A change in design improves the test in one respect only by sacrificing some” 
thing else. Some tests work well with children but not with adults; gni 
give precise measures but require a long time; some give satisfactory gener? 
measures but are inferior for detailed diagnosis. 

Tests must always be selected for the particular purpose for which they 
to be used. Even in similar situations the same tests may not be appropri? : 
Readiness of a child for first grade must be measured by different tests; : 
pending on the instructional plan. Tests which select supervisors well in ° 
plant prove valueless in another. And clinicians may have to choose differs 
tests for each patient. No list of “recommended tests” can eliminate 
necessity for carefully choosing tests to suit each situation. 

The user of tests has constantly to evaluate new developments. New 
are produced, new uses of tests are discovered, and new findings about ° 

96 


are 


test 
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tests are brought to light. Any list of superior tests, therefore, soon becomes 
outdated. Of the nine psychological tests most used in clinics in 1935 only 
two remained in the top nine in 1946. Only two of the newcomers to the list 
Were published later than 1935; the other five were available in 1935 but 
their usefulness was overlooked at that time (Louttit and Browne, 1946). 
Buros (1941, p. 11) made the following comment in selecting tests for discus- 


sion i is 
9n in his yearbook: 


The decision was made to include old as well as new tests. Reviews of 
old tests may prove effective in eliminating from use many tests which 
Were among the best in their day but are now outmoded and inferior to 
recently constructed tests. On the other hand, such reviews may result 
in increasing the use of old tests and testing techniques which compare 


very favorably with tests being currently published. The sale of out- 


moded and ever-decreasingly valid tests persists far beyond the sale of 


textbooks published in the same years. 

Prominence and popularity are not necessarily signs of quality. In clinical 
Psychology and counseling particularly, fads in testing flourish. As Schafer 
Says (1954, p. 6), "Because of its rapid growth, a boom town rav vr) has 
tharacterized clinical psychology until very recently. News of a ‘good’ test, 
ike news of striking oil, has brought a rush of diagnostic drillers from the 
Old wells to the new and has quickly led to the formation of a new elite" of 
Persons specializing in that test. Techniques rushed into application far in 
Advance of adequate research include projective tests such as the Ror- 
Schach, formulas for detecting brain damage from intelligence tests, and 
questionnaires such as the Taylor anxiety scale. The last-named, indeed, is 
® set of questions which the author developed for use only in laboratory re- 


Search on learning, but clinical investigators seized the scale for diagnostic 
Use without any evidence that the sc ior for that purpose. Many 


ale was super 
o : ali hold their "best- 
f the fads in testing wane quickly, alid tests à 


but some inv: 
Sell » ý 
er status for a generation. 


The testing “industry” of today had 
Psycholo anted to O 
tual 


informal, even casual, beginnings. A 

gist or physician w pserve some type of motor, intellec- 

» Or emotional behavior and chose à stimulus or task which he thought 

Bave a a son, As he mentioned his findings to 
ood ity for observation. 

i ipid and laboratories. Soon 


Others ; ir own clinics 
, they copi i hnique in their ow : 
papel bee tachistoscopes for studying flash 


lere was erit 
as a small market for equipe ' i i 
Perception blocks for tests like Kohs’, etc.). A few books were written be- 


Ween 1910 and 1915, each describing one investigator's procedures, but 
ie acture of tests. Test publication in the mod- 


der i f 
e was no large-scale manuf Maas i i 
est in clinical and educational testing 


Ern se 
nse resulted from the great inter i 
S World War I, and particularly from the popularity of standardized 
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group tests. The view then prevalent was that a test score could be inter- 


preted adequately only by comparison to national norms based on thousands 
of cases. Users all over the country wished to purchase the same tests, pack- 
aged in a form which ensured uniform procedure, and accompanied by na- 
tional norms. 

One of the forces creating the American testing industry of today 
decentralization of schools, clinics, and guidance services. Every school sys- 
tem is free to adopt tests or not, and to choose whichever ones it prefers. 
Each counseling agency can purchase a different set of tests, and sometimes 
each psychologist within the agency chooses tests for himself. This decen 
tralization, combined with a demand for carefully developed instru! 
provides a competitive market which encourages publication of tests in “i 
number and great variety. With these tests available, an industrial psycho? 
gist rarely thinks it necessary to make up new tests for his own pan 
Even a great national agency such as the Veterans Administration relies 9 
published tests for its clinical program. the 

In Europe, competitive publication is almost unknown. There is, on " 
one hand, a tendency for each clinical psychologist or each industrial ei 
chologist to develop his own testing procedures—that is, to modify " 
methods his colleagues have used. School systems and guidance services, h 
the other hand, are generally under centralized national control. Ea 
service therefore develops its own series of standardized tests, and 3 
counselor or local school administrator has no choice. A certain amount o 
test publication is now beginning in Europe, the stock consisting largely ^. 
translated American and British tests. There are also books by dur 
describing diagnostic procedures. Procedures are not fully standardi2 J 
and little test research is published; this means that the person response. 
fora testing program in Europe must take on even greater responsibi 
than his American counterpart. pad 

American test publication began in a small way. A psychologist who E 
prepared a test printed copies for general sale, perhaps through a fim” ar 
ing apparatus to psychology laboratories. As the demand for tests gre” st 
ticularly after World War I, some textbook publishers began to handle t 
and some firms specializing in school tests were established. Until 2 wh? 
1945, the typical test was developed by an author or team of authors ve 
completed the test and then offered it to the publisher. The publisher E 
some assistance in the final stages of research and in editing the test ma 
but the main scientific responsibility was the author's. that 

In recent years, this situation has changed. Experience made clear est 
satisfactory tests require long periods of development, following the mot? 
technical research design. Publishers and consumers began to examin? dinb 
critically the quality of test material and the technical information reg 


is the 


ments, 
great 


, 
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the effectiveness of the test. Today, authors are often discouraged from re- 
leasing new tests on which research is inadequate, even though tests of sim- 
ilar quality would have been accepted by most publishers twenty years ago. 
Test construction has increasingly been taken over by the test publishers, 
who have added technical staffs for this purpose. 

In addition to tests on the open market, Americans are making increasing 
Use of so-called *program" tests. These are tests developed to fit the needs of 
a particular large program. Examples are an aptitude test developed for 
medical-college admission, another for awarding college scholarships on a 
competitive basis, and a battery of aptitude and achievement measures 
used in schools throughout a state for guidance purposes. These programs 
often require pend" ads whose questions can be kept secret from persons 
to be tested. Tests for the programs are sometimes developed by professional 
Persons employed by the testing agency. More often, the tests are developed 
by the staff of a test publisher. Many of the tests are constructed to resemble 
Published tests which have already been standardized and validated. Pro- 
Sram tests should be developed as carefully as tests published to be used 
by the whole profession. Technical information about the quality of the tests 
is not readily available to the general student of testing, which sometimes 
interferes with the evaluation of research using program tests Y 

Though there has been an increasing concentration of responsibility in the 

ands of persons well trained in test construction, even the newer tests have 
marked limitations. Some of these limitations result only from the fact that 
no one test can do everything, but some tests still are published without 
adequate research and refinement. Some, even popular ones, do not Pu 
m measuring what they were intended to measure, and some qoe E har- 
acteristics other than what their titles suggest. acm i the au E 
description of a test understandably advertises its favorable "ol ven 
today, some test manuals seriously mislead the uncritical Sun : ne eios 
zPitude battery for vocational guidance was published with : c 

rst glance to be impressive evidence of validity; bs flor set of tests 

ence” consisted of validity coefficients for an entirely ^ oe hates 
used in military selection. The only connection d he he 
was a vague resemblance in plan. Since some publishe ues E ec ie y 
Worthless, and since others extremely useful for one purpose wi 8 per s 
Well in another situation, the user must be able to choose among ests intelli- 


Bently, 


Ability to judge tests is importa 


i themselves. The business nera d The psychiatrist or juvenile- 
Problems over to psychologic@ Los 


Court ; ibilit for choice and interpretation of 

u responsibility 

tests bg d eg done Fests, Nonetheless such consumers of test results 
psy j 


nt for many people who will never choose 
turn his selection and promo- 
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s 
'eaknesse 
must know how tests are evaluated and be aware of the common W 


;hich 
of tests. Some industrial consultants recommend testing programs n 
other psychologists regard as overclaborate or inadequately ee right 
executive needs to know something about testing if he is to << enden} 
questions regarding their proposals. There is an understandable pae 
for the clinical tester to become overenthusiastic about the eris an 
which he is expert, and to make his recommendations too confie dii. re- 
On the other hand, executives, psychiatrists, judges, and others menia" 
ceive reports from psychologists frequently depart from the recom! 


P vai E 
á : " acts not a 
tions made. Such departures, insofar as they take into account fact eight t0 
: tw 
served be 
nation 
are 


able to the tester, are necessary and justified. But giving grea 
supplementary impressions and little weight to objectively ob in 
havior spoils more decisions than it helps. If the user of test into} ns 
knows how tests are validated, he can decide when his own impressio?" 
substantial enough to be given comparable weight. 


a; d 
1. "Improving a test in one way weakens it in another." What advantag® 
what disadvantage, comes from each of the following changes? 
a. Lengthening a test. 
b. Making it interesting to children. 
c. Making it more diagnostic of strong and weak points. 
d. Giving it as an individual test instead of as a group test. l manag? 
2. This is a letter received by a psychologist from an industrial personne is of the 
hiring office and factory workers. How would you answer it on the bas " their 
paragraphs above, knowing that the tests mentioned are representative 
type? 


senc? 

. " is intellig? 

". . . Just now we are planning the use of the following tests: Otis oi 2 d V 
and Minnesota Multiphasic Personality Inventory, and aptitude weeks palant? 
our openings, such as the Bennett test. Does this seem to be a "m tot se! 
testing schedule for industry? Are there tests that you think preferable m 
a i 

a 
the ^o 


3. It has been suggested that the American Psychological Association set UP à 
t 
e 


" f sS 
mittee to award a Seal of Approval to all well-prepared tests. arce 
vantages and disadvantages of such a system. Would this plan e!! 
need for critical judgment by users? 


The Test Manual the 


is 

The manual (sometimes supplemented by a technical handbook), ished 
principal source of information about the technical quality otar "T cor" 
test. The manual is sold with the tests and provides detailed direction” T 
ing procedures, and research findings. 1als on" 

Manuals are not always as useful as they should be. Some tad onc 
facts which users need to judge the test, or gloss over unfavorable " ctio?” 
Even a generally excellent manual may have some inadequate 
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uent in tests issued before 1945, because authors 


Faults are particularly freq ek et ; 
r brought their manuals 


of older tests rarely prepared complete manuals o 
up-to-date, 
Preparing a good manual is difficult. The more research there is on the 
test, the harder it is to summarize properly into a manual. The manual must 
be clear enough so that any qualified user can comprehend it—and so that 
the reader who is not qualified will realize that he is not. Yet the material 
must be precise enough to satisfy specialists in test research. 
The Technical Recommendations. A major aid in the preparation and use of 
test manuals is the Technical Recommendations published in 1954. Com- 
mittees of national organizations interested. in measurement studied the 
1 about tests and prepared a lengthy set of 


Technical Recommendations for Psychological Tests and Diagnostic Tech- 


niques (1954). A supplementary statement. dealing with problems of 


Achievement testing was also prepared (T echnical Recommendations, 


1955), 

The Technical Recommendations indicate wh At 
tain, Many of the recommendations are accompanied by examples 1 ustrat- 
mg good or poor procedure. Figure 17 gives an extract from the Technical 
Recommendations to illustrate their form and content. This chapter and 
the next, on judging the quality of tests, discuss the aspects of tests with 
Which the Technical Recommendations are concerned. 

The recommendations are used in several ways: Authors use them as a 


uide i m j hem in decidin when a test 
Buide in writing manuals, and publishers use t i g : 
mendations draw the attention of test pur- 


uating 2 test. 

construction and manuals 
k of Professor O. K. Buros, who began to release 

ork : : ; 

4, These critical listings NOW take the form of 

t recent of which appeared in 


r : ini : 
problem of improving informatior 


at the manual should con- 


Was acc 

ke accelerated by the w 
feel ME 
tical reviews of tests in 198 
ental Measurements Yearbooks, the mos 


19 
50, 1958, and 1959. ket, and som 
narket, à 


ram tests, are re- 
Nearly all tests currently on the n pee saig 
tis examined by two or more specialists 


View : 

ed in th d sh tes 

y e Buros series. Each te ‘ 

“Aosen because of their practical experience and Pe — Re- 
viewers discuss what each test may best be used for, and -— a ea to 
"d questionable claims made in the test manual. Test reviews may also be 


Ound in several journals particularly Educational and Psychological Meas- 


tement o Psychology: Although these reviews are 
a r Iting Psy? o à j 
"Miedo in ed still judge tests for himself. He will 


an aj 

fin se {ithe purchaser of tests, -, mdging a test articularly wh 

the that reviewers sometimes disagree in U^ ging * vp ary waen 
e : 

End approach it from different ponts 4 d th 
es much attention to rather petty faults, an 


u 


f view. Sometimes a reviewer 
e reader must weigh these 
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criticisms against the merits of the test. On the other hand, reviewers fail Hel 
casionally to notice faults. Even with a well-balanced review ofa particu ar 
test, the final decision to use it or not to use it depends on the specific situa- 
tion, which only the prospective user knows. . 
We have already discussed some of the qualities which make a test suita- 


F. Scales and Norms 


F 1. Scales used for reporting 
scores should be such as to increase 
the likelihood of accurate interpreta- 
tion and emphasis by test interpreter 
and subject. ESSENTIAL 


[Comment: Scales in which test scores 
are reported are extremely varied. Raw 
scores are used. Relative scores are used. 
Scales purporting to represent equal in- 
tervals with respect to some external di- 
mension (such as age) are used. And so 
on. It is unwise to discourage the devel- 
opment of new scaling methods by insist- 
ing on one form of reporting. On the 
other hand, many different systems are 
now used which have no logical advan- 
tage, one over the other. Recommenda- 
tions below that the number of systems 
now used be reduced to a few with which 
testers can become familiar, are not in- 
tended to discourage the use of unique 
scales for special problems. Suggestions 
as to preferable scales for general report- 
ing are not intended to restrict use of 
other scales in research studies.] 


F 2. Where there is no compelling 
advantage to be obtained by report- 
ing Scores in some other form, the 
manual should suggest reporting 
Scores in terms of percentile equiva- 
lents or standard scores. very DE- 
SIRABLE 


[Comment: Professional opinion is di- 
vided on the question whether mental 
test scores should be reported in terms of 
some theoretical growth scale, such as the 
intelligence quotient or the Heinis index, 
Thus, a test developer who has ration- 


ale for such scales as these should tee 
them if he regards them as especially 
adequate. " 

On the other hand, there is no theoreti- 
cal justification for scoring mental test 
in terms of an "IQ" which is not derive 
in terms of the theory underlying Di 
Binet IQ and which has different d 
tical properties than the IQ does. m ; 
ard or percentile scores would be prefer 
able to arbitrarily defined 1Q scales suc 
as are used in the Otis Gamma a” 
Wechsler-Bellevue tests. H 

Strong recommends that Vocational 
Interest Blank scores be converted He 
letter grades where “A” indicates t " 
at least two-thirds of the criterion gro 
equaled or exceeded a given score, p 
He bases this recommendation On ^. 
ground that finer score discrimination 
would lead only to unwarranted fas 
tempts at finer interpretative discri 
ination.] 


F 2.1 If grade norms are prov 
tables for converting scores tO. 
centiles (or standard scores) pe : 
each grade should also be prov! 
ESSENTIAL 


[Comment: At the high school le d 
norms within courses (e.g., second Van 
Spanish) may be more appropriate t 
norms within grades.] 


F 3. Standard scores obtained ved 
transforming scores so that they n d 
a normal distribution and & X 
mean and standard deviation 5 
in general be used in preference e, 
other derived scores. For some tè 
there may be a substantial reason d 
choose some other type of det 
Score. VERY DESIRABLE 


ded 


, 


vel, 


hould 


FIG 17. A section from the Technical Recommendations (1954). 


ble or unsuitable. In Chapter 1 attention was drawn to the necessity ~ 3 
lecting tests which the user is competent to give and interpret. Chap — 
and 4 introduced other considerations, including clarity of direction® K ol 
dom from coachability, convenience of scoring, objectivity, and adequ 
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norms. All these are important because they affect directly or indirectly the 
power of the test to improve decisions. 

The quality which most affects the value of the test, however, is its valid- 
ity. Validity is high if a test measures the right thing, i.e., if it gives the in- 
formation the decision maker needs. No matter how satisfactory it is in other 
respects, a test which measures the wrong thing is worthless. We shall devote 
the remainder of this chapter to validity. Other relatively less important fac- 
tors in choosing a test will be treated in the next chapter, after which we 
shall consider as a whole the problem of choosing a test. 


TYPES OF VALIDITY 


A test which helps in making one decision may have no value at all for 
another. This means that we cannot ask the general question “Ts this a valid 
test?” The question to ask is “How valid is this test for the decision I wish to 
make?” or more generally, “For what decisions is this test valid?” 

Very often, especially in selection or classification, the decision is based on 
a person's expected future performance as predicted from the test score. If 
these expectations are confirmed, the test has given highly useful informa- 
tion, but if the predictions do not correspond to what happens later, the test 
Was worthless. To know how validly the test predicts, a follow-up study is 
required, 

In selection or classification, the psychologist w. 
come: job success, amount learned, obedience to law, etc. He gives a test, 
makes his predictions, tries the treatment suggested by these predictions, 
and waits to see what happens. He obtains a record of the outcome (fore- 
man's rating, school grade, or number of court appearances, for example). 


This record, which we speak of as a criterion, he compares to the prediction. 
> 


This is a straightforward empirical" check on the value of the test. The psy- 


chologist has determined what we call its predictive validity. 

In many situations for which tests are developed, some more cumbersome 
method of collecting information is already in use. If the existing method is 
Considered useful for decision making, the first question in validation is 
Whether the new test agrees with the present source of information. If they 


disagree, the test may have value of its own, but it is certainly not a substi- 
tute for the original method. Validation again requires an empirical com- 
Parison, Both the test and the original procedure are applied to the same sub- 


jects, and the results are compared. For example, tests intended for clinical 
diagnosis are compared with the judgments made by a psychiatrist who in- 
terviews each patient. A test of proficiency in radar maintenance may be 


ants to maximize some out- 


! An empirical method involves collection and analysis of data. It is contrasted with 


Purely logica] methods of arriving at conclusions. 
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compared with ratings given by an instructor who watches each man in the 
shop. This type of empirical check on agreement is called concurrent vali- 
dation, because the two sources of information are obtained at very nearly 
the same time (Figure 18). 

When tests are used to evaluate educational or therapeutic programs, à 


15 


JAN. 


15 9 
ap 
a 


Correlated Test Correlated 


Tost with Gitterion with Criterion 


Concurrent Validation Predictive Validation 


FIG. 18. Predictive and concurrent validation compared. 


different kind of validation may be needed. The program is trying to produce 
a certain change in behavior, and therefore, to evaluate the effectiveness o 
the program, the tester needs to measure just that type of behavior. If a 
course is supposed to teach American geography, it would not be fair to 
measure its effectiveness by a test on the geography of New England. 
The tester interested in evaluation needs to ask, “Does this test represent the 
content or activities I am trying to measure?” Instead of comparing scores 
on the test with some other measure or judgment, as in empirical validation, 
he must examine the items themselves and compare them with the content 
he wishes to include. This process is called content validation. Thus - 
content validity of the geography test would have to be studied by checking 
the items against the course of study the students have followed. 
The aforementioned types of validity are examined when a test i 
tended for a specific practical use. Sometimes, however, the test is used t? 
arrive at a description of the individual which will be used for many n 
poses, or the test may measure outcomes for scientific rather than imam 
ately practical purposes. In these applications, the test results are likely to es 
translated into general psychological terms. Instead of reporting that a? ed j 
perimental treatment “has increased the score on the Jones test,” the psy 


s in 


i PNE. jn- 
chologist wants to make the broader interpretation that “anxiety has ils 
creased. The concept anxiety is part of a psychological theory which ja 

us CO 


what behavior to expect from a person with great anxiety, under vario 
ditions. Whenever a tester asks what a score means psychologically 9 
what causes a person to get a certain test score, he is asking what concep 
may properly be used to interpret the test performance. This type of theore j 
ical concept is called a construct, and the process of validating such a? aD 
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terpretation is called construct validation. In order to show that a given 
ssary to derive hypotheses about test be- 


construct applies to a test, it is nece 
havior from the theory related to the construct and to verify them experi- 
y ht include such 


mentally. The theory of “anxiety” accepted by the tester mig 
1 ersons are exposed to a threat of electric 
Shock, their anxiety will increase; neurotics are more anxious than nonneu- 
Totics; anxiety is lowered by administration of a certain drug; anxious per- 
Sons have a high level of aspiration. Each of these expectations can be tested 
by an experiment or a statistical study of group differences. Determining 
Construct validity is much more complex than the other types of valida- 


tion, as our later discussion will show. 

Table 8 summarizes the statements made to this point. 

With so many different ways to examine validity, each one applying toa 
Particular use of the test, it is apparent that no test developer can validate 
his test exhaustively. The test user cannot expect the manual to provide com- 
Plete evidence on validity, yet he does not wish to use a test whose validity 
is uncertain, What can we Jegitimately demand of the test developer? The 
Technical Recommendations indicate that he must assume ths burden of 
Proof whenever he recommends the test for a certain use. The manual 
should report the validity of each type of inference for which the test is 
recommended.” Most tests have a few principal uses for which their validity 
Mas been thoroughly studied, and this research answers the questions of 
Most test users, The user who wishes to apply the test in any other way may 

ave to make hi ;alidity studies. : 

No sector nee koe test author's research, the person a is de- 
veloping a selection or classification program must, in the p = à for 
Mmself the validity of the tests in his particular situation. And the person 


i ntent validi 
Who is evaluating a training program must determine the co’ alidity of 
equa E hapter we shall concentrate on under- 


the tes i 
sts for this pr s © "a 
ok olen se ie sad ak manuals. Later (Chapter 12), it will 


Standj a 
Ing the i : in test 1 S aber 
erigi pa ter can conduct validation studies in 


expectations as the following: if p 


e 
necessary to examine how the tes 


is Own si " 
tuation. " ical. tt hl , 
The B m AA) is fairly typical, hough less exten- 
ennett TMC manual (Forn Jes studies of pre il Dou orl cras 


Sive t] 
nan s. It summa à ds alan 
Curre Pon Nomi pecu ilitary training. This information in- 
_ ent validity made in industry and mitt 


!cates that the test has consi derable predictive value for mee trades 
= engineering. In the manual there is no information z he e is a m 
‘ctor of school and college grades—a serious per ne Aim pai inh e 
ns of the TMC with several intelligence tests and wi h ^ h jo c so 
= Ptitude tests are reported. This feature informs the tester about the possibr'- 


a her tests, and also aids in inter- 
y of Substituting the TMC for one of the oth 


TABLE 8. Four Types of Validation 
Question Asked 
Predictive Do test scores predict a cer- 
validity tain important future per- 
formance? 
Concurrent Do test scores permit an esti- 
validity mate of a certain present 
performance? 
Content Does this test give a fair 
validity measure of performance 
on some important set of 
tasks? 
Construct How can scores on this test 
validity be explained psychologi- 


cally? 


Procedure 


Give test and use it to predict the 
outcome. Some time later obtain 
a measure of the outcome. Com- 
pare the prediction with the out- 
come. 


Give test. Obtain a direct measure 
of the other performance. Com- 
pare the two. 


Compare the items logically to the 
content supposed to be meas- 
vred. 


Set up hypotheses. Test them ex- 
perimentally by any suitable 
procedure. 


Principal Use Examples 


Tests used in selection 
and classification 
decisions. 


Admission test for medical students is compared 
with later marks. Mental test given infants at 
time of adoption is compared to test of school 
readiness at age 6. 


Tests intended as a 
substitute for a less 
convenient proce- 
dure. 


Group mental test is compared to an individual 
test. Diagnosis of brain damage based on 
Block Design test is compared with neuro- 
logical evidence. 

Achievement tests. A test of shorthand ability is examined to see 
whether the content is typical of office corre- 
spondence. Tasks in a sewing proficiency test 
are compared with the course of study pupils 
have followed. 


Tests used for descrip- 
fion or in scientific 
research. 


A test of art aptitude is studied to determine 
how largely scores depend on art training, on 
experience in Western culture, etc. 
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preting the construct “mechanical comprehension." Content validation is not 
required for the TMC. 

How well does this information fill the counselor's needs? Counselors need 
questions of vocational specialization, yet 


to advise students regarding many 
able. Indeed it would be impossible 


only scattered validity studies are avail 
to conduct separate validations for all the vocations counselees will wish to 
consider, The list would have to cover architecture, aeronautics and hydrau- 
lics, metalworking and woodworking, design, construction, maintenance, 
and so on ad infinitum. Even where a predictive study has been made for a 
Specific occupation, one must recognize that not all jobs within the occupa- 
tion make the same demands. The counselor therefore cannot hope to make 


definite predictions. 


Pirie counselor can interpret the scor 
mprehension" signifies, as here measured. How much does it depend on 


Specific training? This can be determined by learning how much scores in- 
crease during a shop course or a physics course. Does it apply solely to me- 
chanical-manipulative occupations, Or to all work that requires reasoning 
about forces and motion? This is answered by intergrating the available pre- 
diction studies. Are individual differences stable enough to justify long-range 
Predictions? This calls for a long-term follow-up- Does mechanical compre- 
p. d promise skill in handling tools and machines? The answer comes 
Tom the comparison of TMC scores to scores on apparatus tests. 

The Bennett manual does not include all this information. Older tests, in 
Seneral, were published without comprehensive validation, and even the 
Sst manual must leave some questions unanswered. The modern DAT 
Manual, after 35 pages of validity data, concludes with a statement urging 

© counselor to prepare expectancy tables for courses m his own school and 
9r jobs in his own community. The test constructor is not expected to an- 
Swer every last question about validity before publishing his test, but he is 
Expected to give the test user à fair impre 


e only by knowing what "mechanical 


ssion of its validity. 


e studied in these situations? 
o test men to determine which ones 
d to contractors who have vacan- 


4. 
Waal predictive or concurrent validity b 
+ The U.S. Employment Service wishes t 
bye had enough experience to be referre 
cies nf 
*A i dar pirina RR to test the personalities aru gpp onis to deter- 
mine which ones are best suited to a physician $ responsi Dites. . 
aA pencil-paper test is used to identify students amens junior high school 
who have emotional difficulties and should be single out a counseling. 
* A typing test which has excellent content validity for ine origine usermay have 
Poor content validity for some other user Illustrate this statement. — 
© Why would it be able to find out “what a test of pharmacy aptitude meas- 
ures,” if we already know that it predicts success in pharmacy school? 
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PREDICTIVE AND CONCURRENT VALIDITY 


The Criterion 


An investigator studies predictive validity when his primary interest is in 
some outcome. The outcome is what we want to improve by our professional 
decisions: it is the employee's production on a job, the patient's response to 
therapy, the counselee’s satisfaction with his life after counseling. The cr 
terion is a record of the outcome. 

For example, suppose a wholesale hardware concern wants to hire g0 
salesmen and is trying a test to predict sales ability. The outcome that 1° 
terests the firm is the sales each man will make. For research, this outcome 
has to be expressed in terms of some definite index of success. Perhaps 
“amount sold in six months” will serve as a criterion measure. This result 
must be compared to test scores recorded before the men were hired, t° 
learn how much predictive validity the test has. If the test is unrelated - 
the criterion, it is invalid for selecting salesmen for this firm. A single pre 
dictive study does little to clarify what psychological factors are represente" 
in a test, but it does establish the test's usefulness and limitations for 0? 
practical situation. 

The greatest difficulty in empirical studies is to obtain a suitable cr : 
measure. If the index does not really represent “selling success,” the test 7 
not been given a fair trial. Let us look at the weaknesses of the criterion sug 
gested for validating the salesmanship test. In the first place, it represen 
only the wholesale hardware business, so that at best we can judge the i 
for only this one use; additional predictive studies will be required if the h 
is considered for hiring men to sell insurance or machine-tools. Althovf, 
"amount sold" appears to be a fair basis for judging success, some men “pil 
assigned more desirable territory than others, so that sales do not reflect e 
ity alone. Suppose we control this by comparing each man’s sales with m on 
sales in his territory. We still have not considered the possible ai ther 
business of variable factors, such as poor crops in one region. Still t : 
problem is that sales alone may not be what we desire from a sales but 
A high-pressure salesman may build up high total sales on a first trip isi 


ud u 
by overselling, create problems which will eventually harm the firms 
ness. 


ood 


jterion 


ali? 
A common type of criterion is the rating or grade. Aptitude tests ra ged 
dated against marks earned in school. Industrial predictors are va? ji 
against ratings by supervisors. These ratings are rather poor criteria ©” ü ges 
the judge often does not know the facts about the person and because jn j 
disagree. When a test fails to predict a rating, it is hard to say wheth 


is the fault of the test or of the rating. 
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Concurrent validity is investigated when the test is proposed as a substi- 
tute for some other information; this information is then the criterion. De- 


signers of new tests frequently establish concurrent validity for their instru- 
ments by comparing them to established tests. New tests of intelligence, for 
example, have frequently been correlated with the Stanford-Binet intelli- 

1 construct validity have been studied ex- 


Bence test, whose predictive anc 
tensively, A test which agrees with the Binet test measures “whatever the 


Binet test measures" and may be relied upon for the same purposes. This 
procedure is helpful only if the test used as criterion is meaningful and im- 
Portant. There is little value in knowing that three questionnaires of "neu- 
totic tendency” agree if none of the tests measures anything save ability to 
See through the test and give “desirable” answers. Likewise, a psychologist 
who distrusts psychiatric diagnoses would be hesitant to use them as a crite- 


ri ; 
lon for a personality test. 


7. Criticize each of the following criteria: 
a. Ratings of student teachers by their supervisors, 
ability. 
b. Number of accidents a driver has per year, as an index of driver safety. 
€. Number of accidents a driver has per thousand miles, as an index of driver 


safety. . . 
8. A test of preschool children is validated in three ways: (1) Intelligence is 
hich one has had no previous 


defined as ability to learn responses with w 
experience. The test items are examined and found to fit this definition. 
(2) Scores on the test, given at age 3, are found to be related to reading skill 
and vocabulary knowledge at the end of the first grade. (3) Scores on the 
fest, given at age 3, are found to be related to scores on the Stanford-Binet 


test given at age 16. 
a. What possible uses of the test are warrante 


studies? 
b. Would it be possible for a test to show high validity by method (2) and to 


lack validity according to the other two procedures? 
* A study-habits inventory asks such questions as "Do you daydream when 


You should be studying?" : u . 
a. What criterion would you use fo determine empirically whether the inven- 
tory reall sures study habits? . . 

b. What silere Would you use to determine whether the inventory predicts 

Success in college? : : 
lid? 
C. Which study would be best to show that the test is va 
Criticize the Eadem indicated in the following vc of a study of success of 
teachers college students (cited by Eckelberry; 1947): 
of the [predictor] variables and the 
17, but that between the variables 


as an index of teaching 


d, on the basis of each of these 


10, 


[ pte correlation between all thirty 

School] superintendents’ ratings was °” ye : 

and marks ane during four years of college was = eee ch 

Mare ae a du bra ct fe RII ee Soo acton 

ents: ratings were not, the marks were substituted for the ratings as a criterion 
A 

of Success.” 
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Correlation Coefficients 


A study of predictive or concurrent validity is nearly always reported in 
terms of a correlation coefficient. This is a statistical summary of the relation 
ship between two variables, and it plays a fundamental part in test research. 
It is the most common method for reporting the answer to such questions as 
the following: Does this test predict performance on the job? Do these two 
tests measure the same thing? Do scores people made on this test a year 88° 
agree with the scores they make now? 

To illustrate correlation, let us consider ten hardware salesmen who jc 
given three tests when hired. After six months, when the criterion recores 


" is 
are in, we have the information in the left portion of Table 9. The problem 
TABLE 9. Data on Ten Hardware Salesmen 
Test Scores Criterion Criterion Test Rank 
Salesman Test 1 Test 2 Test 3 Measure Rank 1 2 3 
A 30 45 34 $25,000 6 4 7 7, 
B 34 64 35 3800 2 2 3 55 
c 32 32 35 30000 4 3 9 5 
D 4 52 3 40,000 1 1 5 9 
E 20 74 36 7,000 10 9 1 4 
F 24 50 40 10,000 9 7 6 1 
G 27 53 x 22,000 7 5 4 3 
H 25 36 30 35,000 3 6 8 10 
I 22 71 32 28,000 5 8 2 8 
J 16 28 39 12,000 8 10 10 2 
i : ; ; xam 
to judge which test is the best predictor. The test scores are hard to €** 
ine in "raw" form, since each test has a different average. d right 
he 


One way to simplify the data is to change them to ranks, as in t the 
portion of Table 9. (Note that when two men tie on test 3, we give we 
the rank halfway between the positions which the pair occupies.) oa j 
see that E, poorest on the criterion, has very low rank on test 1, high nd 
average on 3. Man F, also poor as a salesman, is below the median er » 

2, but at the top in 8. Before reading ahead, study Table 9 to deci 
valid each test is for selecting hardware salesmen. 

Rank Correlation. To obtain a single estimate of the goodnes i alvin’ 
we compute a correlation. A simple procedure, useful for studies um " 
few cases, is the rank-difference correlation. (Below, we shall sho is 
product-moment technique, the more complicated computation ed £% 
most used in test research.) The symbol p (the Greek letter rho) is je the 
a rank-difference correlation coefficient. In Computing Guide 4, we sho 
steps in determining pıc comparing test 1 with the criterion C. 


t; 
s of each gu 
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When the computations for all three tests are performed, we have these 


Correlations between test and criterion: 


pie = -182 
p2c = ise 
pac = —-704 


A positive coefficient shows that high standing on the test goes with high 
Standing on the criterion. A negative coefficient shows that high standing on 


the test goes with low standing on the criterion. 


Example 
Man A (x = 30; c = 25,000) 
Man A has ranks 4, 6 
N=10 


+ Begin with the pairs of scores to be studied. 
Rank men from 1 to N (number of men) in 
each set of scores. (Note that the lowest man 
must have rank N, unless he ties with some- 
one.) 

Subtract the rank in the right-hand column Man A: 4-6 = —2 
from the one in the left-hand column. This 
gives the difference D. (As a check, make 
sure that this column adds to zero.) 2 = 

- Square each difference to get D? i i et il 
Sum this column to get 2D? 


- Apply the formula: (36) 
6(3: 


~ 100100 — 1) 


6(20*) 
P (rho) = 1 — Nine — 1) 


P 


"NNNM LL e 
Rank Squared 


Ranks Differ- Differ- 


Scores Test Criterion ence(D) ence(D?) 


30 $25,000 
34 38,000 
32 30,000 
47 40,000 
20 7,000 


24 10,000 
27 22,000 
25 35,000 
22 28,000 
16 12,000 


=i 


SHAUN OSUNA 
BaHwono O—-nLnwv»o 


--7r0Q07 monwo» 


[^] 
Blrnowonan =o meg. 


li 


c 
OMPUTING GUIDE 4. RANK-DIFFERENCE CORRELATION 
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À zero coefficient means that one cannot predict the criterion from the test. 
A correlation of 1.00 or —1.00 shows perfect relationship; when this occurs 
the criterion score (or rank) can be predicted exactly. Test 1 identified the 
best salesman and the second-best accurately, but the third-best salesman 
ranked sixth on the test, which lowered the correlation. The larger the corre- 
lation, whether positive or negative, the more accurate the prediction. On 
test 3, a low score picks out a superior salesman. From these data we con- 
clude that either test 1 or test 8 is a good predictor for this firm. 


11. Compute ps, and pze- an. 
12. Obtain a combined score by subtracting score 3 from score 1 for eaeh xe 
Correlate this with the criterion. Would it improve prediction to use both te 


Product-Moment Correlation. Although harder to learn, the product-moment 
technique for computing correlation is easier to apply to large groups en 
the rank method. The rank formula is equivalent to computing the produc 
moment correlation between ranks. 

A product-moment correlation (r) may be determined from a be 
diagram” obtained by plotting pairs of scores. The scores of Table 9 can j 
put in this form. We set up a chart, with the first variable (test score) alon à 
the horizontal axis and the second variable (sales) along the other (see wA 
ure 19). Man A is plotted above 30 on the x-axis (test 1), and opposite ps 
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FIG. 19. Scatter diagram for test 1 and criterion c. 
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000 on the y-axis (criterion). We can observe how criterion scores correspond 
to test scores. As score l rises, c tends to rise. 

At the end of the chapter is a computing guide showing how one ob- 
tains a product-moment correlation from the scatter diagram. It is not neces- 
Sary to learn this procedure in order to interpret coefficients. 

Any statistic has a certain variation from one sample to another. Even if 
Broups of subjects are drawn at random from the same population, the corre- 
lation coefficients between two variables will differ from sample to sample. 
akes the correlation more dependable. The 
fluctuation of correlations from sample to sample may be considerable. If the 
Correlation of two scores in a large population is .80, in ten random samples 
of 100 cases each the product-moment correlations would vary thus: .17, 
4T, 34, 81, .24, .39, .20, .25, .28, 45. If samples are not random, but come 
from different firms or communities, the fluctuation of coefficients will be 


e 
Ven greater, 


Usi 
Sing a large sample of course m 


and 3 to the criterion. 
Computing Guide 4 change if person 
1 and criterion $26,000? 


1 
"à Prepare scatter diagrams relating tests 2 
* How much would the rank correlation in 
J had been replaced by a person with score 2 


Meaning of Correlations. How well one variable predicts another is shown 
ams corresponding to 


by the scatter diagram. Figure 20 shows scatter diagr 
Various sizes of coefficient. When r — 1.00, one variable is predicted perfectly 
from the other. With r — .60, prediction is only approximate. People who 
Stand at 8 on x average near 7 on Y, but they spread from 3 to 9. An em- 
Ployer wishing not to lose any applicant whose Y score is 8 or better would 
AN an X score of 4 or better. Prediction becomes pro- 
Stessively poorer as the scatter diagram becomes “fatter.” 
Another way of considering the meaning of correlation is to translate the 
Scatter diagram into an expectancy chart. The expectancy charts shown in 
igure 10 (p. 78) correspond to test-criterion correlations of .51 for mechani- 
cal aptitude, 47 for trade information, and .26 for the nut-and-bolt test. 
hen the correlation is less than 1.00, one measure is influenced by some 
ipe not found in the other measure. Random errors of measurement lower 
a lation, So do causal factors not involved equally in both variables. For 
amet the correlation between intelligence ead selina marks is anly mod- 
: © because many factors besides mental ability influence the marks: pupil 
ort, teacher bias, previous school learning, health, and so on. 
* It is incorrect to interpret high correlation as showing that one variable 
Causes” the other, There are at least three possible explanations for a high 
Correlation between variables A and B. A may cause or influence the size of 
E May cause A, or both A and B may be influenced by some common fac- 
Or factors, The correlation between vocabulary and reading may be 
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FIG. 20. Scatter diagrams yielding correlations of various sizes. 
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taken as an example. Does good vocabulary cause one to be a good reader? 
Possibly. Or does ability to read well cause one to acquire a good vocabulary? 
An equally likely explanation. But to some extent both scores result from 
high intelligence, a home in which books and serious conversation abound, 
or superior teaching in the elementary schools. Only a theoretical under- 
Standing of the processes involved, or controlled experiments, permits us to 
state what causes underlie a particular correlation. Without this, the only 
= conclusion is that correlated measures are influenced by a common fac- 
or, 


15. How large a correlation would you anticipate between the following pairs of 


variables? 
a. Age and annual income of men aged 20 to 50. 
b. Age in January, 1930, and age in March, 1950. 
€. Scores on two intelligence tests, given the same week. 
d. Annual income and number of children, among married urban men. 
e. Maximum and minimum temperature in Wichita, each day for a year. 
16. What is the expectancy of earning above average on Y, if a person has a score 
of 8 on X? Determine this for each value of r in Figure 20. 
17. What possible causal relations might underlie each of the following corre- 
lations? 
a. Between amount of education and 
is positive). 
b. Between ayerage intellige 
r is negative). 1 
€. Between Sunday-school attendance and honesty of behavior (assume that 
r is positive). 
Beginning with the information in 
similar to Table 1, corresponding to each of t 
‘90, .40, .20. 


annual income of adults (assume that r 


nce of children and size of family (assume that 


Figure 20, prepare an “expectancy table” 


18. 
he following values of r: 1.00, 


Typical Validity Coefficients 


Correlations between test and criterion are called validity coefficients. 


Table 10 lists some fairly typical coefficients of predictive and concurrent 
validity , taken, in each case, from the test manual. Some test-criterion com- 
inations yield much greater validity than others. The variation in results 
for the Short Employment Tests should be particularly noted in Table 10. It 
I5 very unusual for a validity coefficient to rise above .60, which is far from 
Perfect prediction. I MEA 
Although we would like higher coefficients, any positive correlation indi- 
Cates that predictions from the test will be more accurate than guesses. 
ether a validity coefficient is high enough to warrant use of the test as a 
Predictor depends on such practical considerations as the urgency of im- 
Proved prediction, the cost of testing, and the cost and validity of the selec- 
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tion methods already in use. To the question "What is a good validity coeffi- 


cient?" the only sensible answer is, "The best you can get.” 


be predicted only with validity .20, the test may still make an 


practical contribution. Naturally, a greater contribution is required to jus 
an expensive, inconvenient procedure than an inexpensive one. 
In interpreting any correlation coefficient, the range of the group s 


' If a criterion can 
appreciable 


tify 


tudied 


TABLE 10. Illustrative Validity Coefficients 
Type of vid 
Test Sample Criterion Validity Coefficient 
California Short-Form 100 children re- Wechsler individ- Concurrent 7 for total 
Test of Mental Matu- ferred to a ual test seore 
rity guidance de- 
partment 73 
Gordon Personal Profile 122 college stu- Ratings of person- Concurrent 49 to yi 
dents ality by dormi- fo cores 
tory mates 
lowa Tests of Educa- 634 students in six Grade point Predictive -58 
tional Development lowa colleges averages as 
tested in grade college fresh- 
9 men 
Short Employment Tests: 51 operators of Supervisor's merit Concurrent 
Verbal proof machines ratings of job 15 
Number in a large bank performance 25 
Clerical 37 
Short Employment Tests: 80 skilled opera- Records of pro- Not stated 
Verbal tors of book- duction on ten 10 
Number keeping ma- days 26 
Clerical chines in a bank 34 
Short Employment Tests: 262 students ina Satisfactory com- Predictive 
Verbal one-year secre- pletion of 15 
Number tarial training course vs. non- AB 
Clerical course completion or 47 
noncertification 
Short Employment Tests: 52 stenographers Ratings of job Predictive 45 
Verbal and clerks in an performance ji 
Number industrial con- -08 
Clerical cern 3 


"E 
n 
P " $ 2 han ! 
must be considered. The correlation is smaller in a select group th: ould 


group containing a wide range of ability. High 
predict college marks with a validity much above .60 if 
records went on to college. The validity of the Iowa Tests 


-school achieveme?" E 
all those with P " 
for advising P 


lecti? 


whether to plan on going to college is higher than their validity for $° aus? 


among the high-school graduates who apply for college admission, 
the latter group is already restricted. (For further discussion of t 


see p. 351.) 


Relation of Concurrent Validity to Predictive Validity. 


badly, he may want to employ it for prediction eve 
on its predictive validity has been accumulated. Indeed, when a t€ 


If a user needs ? 
n before the evi 


pis po" 


test 
ce 
a5 
st SUC 
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the Strong Vocational Interest Blank is invented to predict whether an 
adolescent will enjoy a career as a physician, the criterion data cannot be ob- 
tained for some twenty years. This would sometimes mean a very long delay 
between the invention of a test and its practical use, if we insisted on waiting 
for the predictive coefficient before applying the test predictively. 

Concurrent validity can be determined at once, and may shed some light 
on the probable predictive validity of the test. Strong published his test in 
1928, offering as evidence of validity the fact that interest scores distin- 
guished men of different occupations from each other. For instance, the 
Physician scores of doctors averaged much higher than those of nondoctors. 
The purpose of the test is not to find out whether a man is at present a doc- 
tor; it is to find out if a young man will, as he grows older, be satisfied with 
that career. If the direction of a man’s interests at age 40 is the same as at 
20, then the concurrent validation based on older men does show that the 
test can safely be used to give vocational advice at age 20. Until long-term 
follow-up studies were made, users of the Strong test had to assume stability 
Of interests. After publication of the test, Strong continued to accumulate 
evidence by following adolescents for twenty years or more and by 1954 was 
able to verify that the test indeed predicts vocational status over a long 
period. 

A concurrent-validation procedure m 
dictive test, by administering it to perso 


be Observed immediately. An aptitude 
Biven at the time of graduation from medical school, grades being used as a 


Criterion. A test intended to identify potential neurotics may be evaluated 
by determining whether it distinguishes present neurotic cases from some 
Nonneurotic group such as medical patients coming to the same clinic. Kohs 
(1923, p. 182) offered a concurrent validation when he reported a correlation 
of .80 between Block Design IQs and Stanford-Binet IQs at the time he re- 
leased his test, Almost never do we find research reports in which concurrent 
and predictive validities are determined under the same conditions, so that 
We cannot say just how much they are likely to differ. A reasonably close 
Comparison may be made between the following correlations of educational 
Proficiency tests with college grades in corresponding courses (Dressel and 


Schmid, 1951): 


ay be employed for almost any pre- 
ms whose criterion performance can 
test for medical students may be 


Concurrent Predictive 
i 61 55 
English E 
Social Studies 79 30 
.68 AD 


Science 


tained at Michigan State University and 


T 
he Concurrent correlations were © 
tmouth, and it may be that the smaller 


e ipl 
Predictive correlations at Dar 
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range of ability at Dartmouth accounts for part of the decline in the væ 
lidity coefficients. 


19. Which of the following describes concurrent validity, and which describes 
predictive validity? In which instances has concurrent validity been measure 
even though the test appears to be intended for predictive purposes? 
a. The Short Employment Tests are found to correlate .91 with the Genere 

Clerical Test, which has been used for some time as a predictor of jo 
success. 3 

b. The manual for the Henmon-Nelson Test of Mental Ability reports corre 
lations with school grades assigned one month later. 

c. A correlation is calculated to determine how well a certain test di à 
patients diagnosed as schizophrenic from those diagnosed as brain- 
aged. 

d. School records of delinquents and nondelinquents in high schoo jas 
searched to learn what scores collected in the elementary grades correla 
with delinquent status. n ito 

20. A test of ability to understand spoken words is validated by administering n : 
first-graders at the end of the year, and correlating it with their present “er 
ing ability. The coefficient is fairly high. Would you expect a similarly 
predictive-validity coefficient ad- 
a. if the test is used at the start of the first grade to predict end-of-year re 


ing? r 
" x o! 
b. if the test is used at the end of the first grade to predict response of P 


stinguishes 
dam- 


| are 


readers to a special remedial reading program? de ! 
c. if the test is used with 4-year-olds, to predict later success in Gra 
reading? rent 
21. Would the attitude of present employees, taking a test in a ee t 
ta 


validation experiment, be the same as the attitude of applicants 
test? 


Example of Validity Information. Let us now examine some of the a ki 
validity given in the DAT manual, for that version of the TMC. The -— 
fered include correlations with subsequent course grades, a fourye ari 
low-up of school achievement, and a follow-up of post-high-school ee 
Only a brief extract can be considered here. psi” 

Table 11 gives some of the coefficients relating mechanical comprehe d of 
to course grades in science and shop. It is obvious that one cannot P fof 
"the validity" of a test for a certain field, save as a shorthand apt 
a general trend. The variation of coefficients is great, even from gi? pins 
group in the same school. There are many explanations for this: mr grad 
fluctuations, differences in course content, differences in reliability © i jla 
ing, differences in level of ability, ete. Ghiselli (1955, p- 112) reports : co 
radical fluctuations among validity coefficients for tests of mechanic" d 
prehension against training criteria for repairmen in industry. He E ó 60. 
over 100 coefficients from various studies, finding a range from —.80 oud 
Eleven studies reported validity coefficients above .50, and fourtee? 
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coefficients below .20. This finding is not peculiar to the mechanical reason- 
Ing test. Whenever numerous coefficients are obtained for the same test and 
the same type of criterion, variation such as this is found. Only where condi- 
tions are highly standardized is a second validity coefficient likely to dupli- 


TABLE 11. Validity of the TMC as a Predictor of Course Grades for Boys 


Time Between Number Validity 
Test and Marks of Cases Coefficient 


Course Grade Location 
Industrial Arts 9 Mt. Vernon, N.Y. 1 year 67 .39 
9 Worcester, Mass. 4 months 89 20 
Ww 9 Worcester, Mass. 1 year 79 .05 
9odworking 10 Independence, Mo. 3 months 42 .30 
eneral Science 9 Mt. Vernon, N.Y. 1 year 84 19 
Physi 9 Columbia, Mo- 8 months 88 50 
Yes 1 Schenectady, N.Y. 31 years 42 A 
N.Y. 1-2 years 41 Al 


a White Plains, N.Y- 
Sounce: Bennett et al., 1959, pp. 44 ff. 
c , as 
ate a first, Where training follows a uniform plan, where the level of ability 


is held constant by selection, and where the criterion is based on objectively 
measured performance, validity coefficients are as stable as the size of the 
Sample allows, Such a coefficient, however, may not confidently be assumed 


to apply to another setting. 
g> most predictive uses of tests, 
ore than a hint as to whether the test i 
must validate the test in his own school or f 
E coefficients to fluctuate. For this reason, testers 
Pon a psychological rather than a purely statistical use of scores. 
While a test publisher may be expected to include representative validity 
Studies in the manual, much further evidence accumulates after the test is 
distributed, Only a thorough search of professional journals can locate all 
this information, The industrial psychologist can find many of the studies 
relevant to his selection problems in the “Validity Information Exchange” 
Published quarterly in Personnel Psychology. One particular issue (Autumn, 
1954), for example, presents thirteen different studies, among which three 
Use the Bennett test. We learn that among policemen in St. Louis, Du Bois 
and Watson obtained correlations for Form BB of about .28 with training 
Brades and marksmanship, .20 with an achievement test, and .10 with rating 
on duty, (No other test uS better able to predict the rating.) Bruce found 
NO useful relation between Form AA and ratings of foremen in a tobacco 
Plant, and McCarthy a correlation of —.10 with ratings of foremen in a fac- 
tory Whichsakes-clecuital equipment. Another useful source is the Dorcus- 
Jones Handbook (1950), which abstracts 496 studies on employee selection 
Published prior to 1949. Tt is evident that the test manual can never exhaus- 


ive on 
ly summarize or integrate such a varied literature. 


the published validity coefficient is no 
s relevant to the tester's decision. He 
actory, and even then he can ex- 
are usually forced back 
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22. How can the negative correlation of Bennett scores and foremanship ratings be 
explained? 3 

23. The mechanical reasoning test yields coefficients as high as .50. with this 
validity, what is the likelihood that a person who is above average on the test 
will be above average on the criterion? (Use Figure 20.) 

24. How do you account for the fact that the validity of the TMC as a pre 
physics is not higher than .50? 

25. What facts might the principal of the school in Schenectady obtain t 
mine why the TMC predicted shop grades poorly, as compared to r 
some other schools? Should these facts be included in the test manual? . 

26. What factors account for the variation in validity coefficients for industria 
pairmen found by Ghiselli? 
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CONSTRUCT VALIDITY 


asure exactly 


Every test is to some degree impure, and very rarely does it me 
ti] we know 


what its name implies. Yet the test cannot be interpreted un 
what factors determine scores. As Kent says (1937, pp. 429-423), 


, ve his 
When a child of reading age is referred to the clinic because wer nA 
failure to learn to read, it is of the first importance to ascertain wheth 


" P T ND on si est 
his mental capacity is or is not within normal limits. A compas p 
he subj 


al retar 4" 
serious 


which contains reading matter . . . discriminates against t 
whose inability to read is due to any cause other than ment 
tion. A test which calls for oral response discriminates very 565,7 0. 
against the child who by reason of speech defect or impediment 15 u E 
able to make himself understood. It is little more than a farce to pd 
timed test or a test containing timed items for a psychotic subject who 
mental processes are pathologically slowed up. What we punt 
the test may be significant, but it is something quite other than W™ 
the test is intended to measure. 


" sa 0 
e with criteri 


idit 
nt vali? 
te c the 


ality’ 
ms 


Such items as Kent criticizes would probably correlat 
school success, and would probably be judged to have “con 
as a sample of significant adaptive performances. The difficul 
test cannot be interpreted as a measure of a single psychological ge 

Construct validation is an analysis of the meaning of test scores 10 
of psychological concepts (Cronbach and Meehl, 1955). d bet 

Sometimes the tester starts with a test which he wishes to understa? aring 
ter. Sometimes he starts with a concept for which he wishes a pep - 
instrument. The interpretation of a test is built up very gradually, an ven d 
ably is never complete. As knowledge develops, we arrive at a rei 
plete listing of the influences that affect the test score, and may 7 é inte 
estimate the strength and character of each influence. At present, t? 


ty is tha 
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pretation of even the best-established psychological tests falls far short of 
the ideal. 

While predictive validity is examined in a single experiment, construct 
validity is established through a long-continued interplay between observa- 
tion, reasoning, and imagination. First, perhaps, imagination suggests that 
Construct A accounts for the test performance. The investigator reasons, "If 
that is so, then people with a high score should have characteristic X." An 
experiment is performed, and if this expectation is confirmed, the interpreta- 
tion is supported. But as various deductions are tested, some of them prove 
to be inaccurate. The proposed interpretation must be altered either by in- 
voking a different concept, by introducing an additional concept, or by alter- 
ing the theory of the concept itself. The process of construct validation is the 
Same as that by which scientific theories are developed (Spence, 1958). 
Some constructs are “young” and not much theory has developed around 
them. Mechanical comprehension is an example. Older concepts (for exam- 
ple, intelligence and ego-strength) are imbedded in elaborate theories. 


There are three parts to construc 
Suggesting what constructs might account for test performance. This is 
an act of imagination based on observation or logical study of the test. 
Deriving testable hypotheses from the theory surrounding the construct. 


This is a purely logical operation. 
Carrying out an empirical study to test this hypothesis. 


t validation: 


not be so neatly ordered. Often one 
before offering an interpretation. 
fore any theory is developed 


Th 

€ actual sequence of operations need 
with a test 
a long time be 


e TMC. 
laining performance on the TMC? 


nfluences affect the score and 


eii sio much experience 
Mis TeS the test is used for 
und it, This was true, indeed, for th 
Nine: do we mean when we talk of exp 3 
entially, we mean being able to state what i 
what influences do not. Once it was thought that there were three “intelli- 
Bences"- verbal, mechanical, and social. To test the interpretation that the 
MG meásures "mechanical intelligence,” we would have to know what 
his is. If it is said that mechanical intelligence is an inborn ability to per- 
TM all tasks involving apparatus, we can begin research. We find that the 
MC correlates .68 with a pencil-and-pape? test of reasoning with forms, but 


nly .08 to 39 with various dexterity tests We are inclined, therefore, to in- 
roblem solving rather than of mechanical per- 


much better than girls, we become 


0 
n that point shows that scores do increase after 
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by a small amount. Differences between persons are much the same before 
and after the course. 

It is necessary not merely to identify an influence but to fin 
strong it is. The writer once suspected that much of the TMC score de- 
pended on knowledge of a few specific principles (e.g., gears, levers), each 
of which was involved in several items. But when separate scores were ORS 
tained on each type of item, these scores correlated highly. Since a pesos 
high on gear problems was high on other items, it was unnecessary to p 
duce the concept of specific subtypes of mechanical comprehension. it 
specific knowledge could account for only a small amount of the differen 
among persons. 

It is already evident that no single type of research is u : 
validation. We can give a brief tabulation of procedures merely to indi 3 
the diversity of methods and describe the relevance of each method to t^ 
TMC. 

e Examination of items. This is sufficient to rule out some explanat 
thus it is easily seen that neither arithmetic nor verbal reasoning affe 
scores. But it is also seen that the machines used are those common > 
Western culture, not in primitive Africa; this reminds us to consider cultur? 
background in interpreting the test outside industrial nations. " 

@ Administration of test to individuals who “think aloud." This may ae 
that in some items quite irrelevant features of the test (e.g, an Dey 
drawing) affect the score. It may show that some people succeed by i: 
intuitive perception of answers which others reach by painstaking wo 
This would suggest that the score means different things for different qur à 

@ Correlation with practical criteria. Learning what courses OY jos 
TMC predicts clarifies what types of mechanical work it applies to- cot" 

o Correlation with other tests (and factor analysis). If the TMC d in 
relates highly with a general intelligence test, it need not be interprete RS 
terms of a special mechanical aptitude. As a matter of fact, it does po 
to a substantial degree, but by no means entirely, on general mental ah : 

9 Internal correlations. The study of separate types of items descr 
above is of this type. irls is ? 

© Studies of group differences. The comparison of boys and 8 P 
example. — oved 

© Studies of the effect of treatment on scores. Training in physics P 
not to affect the TMC greatly. 1d pot 

e Stability of scores on retest. If scores are unstable, one C7 ant 
interpret mechanical comprehension as a lasting, vocationally y 
aptitude. An obtained correlation of .69 between ninth- and twelft = ww 
scores for boys promises a reasonable degree of stability, but also 
that this aptitude is far from a fixed quantity. 
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The user of the test wants to know how the test can be interpreted, and 
how confidently. The manual should indicate what interpretation the author 
advises, and should summarize the available evidence from all types of stu- 
n. If the user wishes to make some other 


dies relevant to this interpretatio 
dence on the test in the light of 


interpretation, he must examine all the evi 
his own theory. 


ue that the Block Design test measured 
"intelligence," defined as "ability to analyze and synthesize." He then offered 
the following types of evidence (plus others) for his claim. How does each of 
these bear on construct validity? (The Stanford-Binet test was at that time 
recognized as the best available measure of intelligence but was thought pos- 
sibly to depend too heavily on verbal ability and school training.) 

a. Logical analysis of the "mental processes" required by the items. 

b. Increase in average score with each year of age. 


27. Kohs (1923, pp. 168 ff.) wished to arg 


€. Correlations as follows: 
Binet score with age .80 
BD score with age .66 
BD score with Binet score .81 
d. Correlations: 
Binet score with teachers' estimates of intelligence 47 
BD score with teachers’ estimates of intelligence 23 
€. Correlations: 
Binet score with vocabulary = 


BD score with vocabulary 
f. Correlations between successive trials: 
28 on Binet, .91; on BD, .84 
* Which of the variables in Kohs' study are accepta 


telligence? 


ble as criteria of pure in- 


Suggested Readings 


French, John W. Validation of new item types against four-year academic criteria. 
: educ. Psychol., 1958, 49, 67-76. Y 
his predice study compares different types of tests for college applicants 
in terms of their power to predict grades and successful completion of col- 
lege work, The study is unusual because of the large. number of [pei 
used, the large sample in each college; and the repetition of fhe experiment 
in many colleges. Note particularly the degree to which results differ for 
ifferent criteri i t colleges 7 . 
Peak, Helen. Mei E m brew mi ation. In Leon pe & nue Katz 
3008), Research methods in the behavioral sciences. New York: Dryden, 1958. 
P. 248-299, 
his chapter, directed toward 
Procedure for a research projec 


al scientist choosing a measurement 


e soci a s: 
th alities which make a proce- 


t, discusses the qu 
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dure satisfactory. Dr. Peak outlines many methods used in establishing con- 


struct validity. 


Thorndike, Robert L. The estimation of test validity: criteria of proficiency. Per- 
sonnel selection. New York: Wiley, 1949. Pp. 119-159. k 
Thorndike describes the various types of measures that may be used as cri- 
teria, particularly in industrial applications of tests. : j 
Validity. Technical recommendations for psychological tests and diagnostic tech- 
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1. Begin with the pairs of raw X Y X Y X 
scores to be studied. 24 35 27 38 26 
25 39 28 37 30 
24 39 29 36 32 
25 36 19 34 30 
2. Tabulate the points in a 31 43 28 37 25 
scatter diagram, entering 22 38 27 32 32 
one tally for each pair of 30 43 25 38 26 
scores. (The first pair [24— 24 35 30 4l 24 
35] is tabulated in the cell 25 40 31 41 21 
above 24 on the X scale, 
and opposite 35 on the Y 
scale. This cell is outlined 
in the illustration.) 
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niques. Washington: American Psychological Association, 1954. Pp. 13-28. 
(Psychol. Bull., 1954, 51, Supplement.) 
This section of the recommendations introduces the four types of validity, 
recommends types of information about validity to be included in test man- 
uals, and. gives specific examples of good and bad practice. It is suggested 


that the reader compare the manual of some recent test with these recom- 


mendations. 


3. Count the number of tallies in each 
Column, and write it below the diagram 
in a row labeled f+. Count the number 
in each row, and write it beside the 
diagram in a column labeled fy. 


4. Select an arbitrary origin for X and for A.O.« = m AN E $c 
Y, and determine the mean and stand- hi = 262 M, = 37.3 
ard deviation for each as in Computing = 3.83 3 = 372 
Guide 2 (computation not shown). SERE " 

5. In each cell of the scatter diagram, 
multiply the number of tallies by the 
value of d. written below that column, 
and write the product in the cell. (In the 
outlined cell, for instance, there are two 
tallies, and d. is — 2; the product is -4l 
In each row, add the numbers written in 
the cells, and place in a column labeled 
fds. 

Multiply each entry in this column by dy 
and enter in a column labeled fd:dy- 
Add the column fdzdy. 
Substitute the numbers in the following 
formula: 
473 
i — uon NK me 
yo ^7 3833372 
a le | 10.51 —.06 10.45 
ry 7 71424. 1424 


COMPUTING GUIDE 5. COMPUTING THE PRODUCT-MOMENT CORRELATION COEFFICIENT 


How to Choose Tests 


THE first important consideration in choosing a test is its validity. As : 
have seen in the preceding chapter, validity information permits us to m A 
whether the test measures the right thing for our purposes. Validity 15 3n 
amined by comparing scores to an external criterion, by comparing items x 
a specified body of content, or by establishing an explanation of pascit 
terms of general constructs. There are many additional qualities to rr 
in choosing a test, some related to its statistical properties and some to 
practical features. 


RELIABILITY 


Reliability studies give information about the consistency of a D 5 
scores on a series of measurements. For example, Bennett reports tha boy 
the TMC for ninth-graders “the standard error of a score is 3.7.” If hae 
were tested many times on a series of equivalent mechanical compre ate 
sion tests, his scores would vary; the standard error is a calculated gunt 
of the amount of this variation. It says that the series of raw scores for : 
one boy would have a standard deviation of about 3.7. Since the stan 1 
deviation of scores of different persons is 10.4 points, a standard error A ge 
allows the boy's position within the group to shift over an appreciable I pa 
as Figure 21 shows. When we test the boy only once, he carns pin nap 
his many possible scores. We do not know in which part of his range 
pened to catch him. - pruste? 

Since scores vary from one trial to another, no one measure can sind "i 
absolutely. The obtained score indicates only roughly the level of the Preis ely 
ability or typical behavior. The smaller the standard error, the more p denc? 
his level can be judged. Reliability information tells how much con 
we can place in a measurement. 

Reliability always refers to consistency throughout a serie 
ments. There are various ways to observe such a series—for exam 

126 


é 
s of measu 


ple 


f 
favorable decision, though revers 
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using the same test repeatedly or by using a series of "parallel" forms. Dif- 
the effect of different types of varia- 
coefficients mean different things. 
and effort change from moment to 
are created by physical 


ferent experimental procedures measure 
tion, and therefore different reliability 

Test scores vary over time. Attention 
moment. Over longer periods, further shifts in score 


Distribution of 
Scores for All Persons 


Scores of 
One Individual 
on Many Trials 


FIG. 21. Variation in standing of a single person. 


nd personality change. If we employ 


Bowth, learning, changes in health, a If we em 
nt, another type of variation is intro- 


different test items for each measureme 
duced. The person who is lucky on one trial, finding items that are easy for 


him, will encounter unfamiliar items on some other trial and earn a lower 
Score. To these variations must be added the unaccountable *chance" effects. 
Chance effects enter even when we use the same procedure twice in rapid 
Succession; the two scores differ to some extent because of guessing, instan- 
taneous lapses of attention, and so on. Table 12 lists the sources of variation 


in test scores, NÉ 

A judgment that a student has completed a course Or that a patient 1s 
ready for release from therapy must not be seriously influenced by chance 
errors, temporary variations in performance, or the tester’s choice of ques- 


tions. An erroneous favorable decision may be irreversible. An erroneous un- 
ible, is unjust, disrupts the persons morale, 


and retards his development. Unless the tester and 


subject recognize how 
fallible a measure is, they are likely to rely on it more than is justified, i 

Research likewise requires reliable measurement. In most a s dim 
designs a test of significance is used to learn whether an ies : at 
results from the experimental trea accounted for by chance 
nce variation 1D 


tment or could be 
Variation, The 1 the cha the test employed, the harder 
Thor " arger 
It is to find a pebr difference between groups. Lar 


ge error variance 
Masks scientifically important variations created by the experimental condi- 
tions, Makin gat uh more reliable improves the efficiency of an experiment 
i e: 


i r of subjects does. 
in the same way that increasing the numbe j 
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TABLE 12. Possible Sources of Variance in a Test Score 


Lasting and general characteristics of the individual 

1. General skills (e.g., reading) : š 

2. General ability to comprehend instructions, testwiseness, techniques of taking tests 

3. Ability to solve problems of the general type presented in this test 

4. Attitudes, emotional reactions, or habits generally operating in situa 
situation (e.g., self-confidence) 


tions like the test 


. Lasting and specific characteristics of the individual 


1. Knowledge and skills required by particular problems in the test g r of 

2. Attitudes, emotional reactions, or habits related to particular test stimuli (e.g. fear ty 
high places brought to mind by an inquiry about such fears on a personality per 

Temporary and general characteristics of the individual (systematically affecting 

formance on various tests at a particular time) 

Health, fatigue, and emotional strain 

Motivation, rapport with examiner 

Effects of heat, light, ventilation, etc. 

Level of practice on skills required by tests of this type 

Present attitudes, emotional reactions, or strength of habits (insofar as th 

partures from the person's average or lasting characteristics—e.g., politic 

during an election campaign) 

Temporary and specific characteristics of the individual 


RON 


ese are de- 
al attitudes 


" e- 
1. Changes in fatigue or motivation developed by this particular test (e.g. discouro9 
ment resulting from failure on a particular item) 
. Fluctuations in attention, codrdination, or standards of judgment 
. Fluctuations in memory for particular facts effects of 


Level of practice on skills or knowledge required by this particular test (e.g. 
special coaching) t stimuli 
Temporary emotional states, strength of habits, etc., related to particular tes 

(e.g., a question calls to mind a recent bad dream) 

Luck in the selection of answers by "guessing" 


6 m ROM 


Source: After R. L. Thorndike, 1949, p. 73. 


: ’ satan de less H0 
In a test intended for predicting a definite criterion, reliability is les m 


portant than predictive validity. If predictive validity is satisfactory» 
reliability does not discourage us from using the test. In comparing i 


tests which measure the same thing, however, the more accurate 


test W! 


have the higher validity coefficient. 


1. 


Locate each of the following sources of variance in Table 12: 


: ile obP* 
a. During a speeded test a student breaks his pencil and loses time whi 


taining another. . unde" 
b. An industrial worker who has been in this country for a short time T?! 

stands an important phrase in the instructions for a performance d igen? 
c. A “hillbilly” is unable to answer correctly a question from an IN 

test about the purchase of a railroad ticket. sieri 
d. A suspicious patient refuses to coöperate and gives perfunctory 4" 


e. A student guesses at every item of which he is uncertain. ght affect pe 
Give an example of each source of variance in Table 12 which mig 


1 
formance on the Block Design test. jon between 165 
Which types of variation in Table 12 would lower the correlation " 
and criterion in each of the following situations? et. Th 


" . me 
a. A test of high-jumping ability is used to select finalists in a track 


criterion is performance in the meet, two weeks after the trials. 
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anical ability is used to predict performance 


b. A pencil-and paper test of mech 
k earnings after training are used as the 


of mechanical trainees. Piecewor 
criterion. 


Interpretation of Coefficients 


Reliability is usually expressed in terms of a “reliability coefficient,” i.e., 
the correlation between two measurements obtained in the same manner, or 
in terms of the standard error of measurement which we have already de- 
scribed. Before considering the procedures used to estimate reliability, we 
can discuss general principles applying to all such coefficients. The prin- 


ciples are as follows: 

* A reliability coefficient tell 
nonerror variance. 

© The reliability coefficient depe 

* The reliability coefficient depe 
studied, 

9 A test may measure reli 
another level, 

© The validity coefficient cannot € 
Coefficient. 

Reliability and Error of Measurement. Variation between persons is de- 
scribed by the standard deviation $, OY by the score panem di This vartas 
tion represents a combination of the differences that we wish to measure 


(e.g., true ability in spelling) and the variation associated with a particular 
4 est, fatigue of some persons on the 


jos h (og, Words rp ld remain constant from 
ay of testi -ue ability of any person would rema 
Mis. aee 3 ld vary to some extent. 


i 1 

ne measure to another, but obtained scores wou 

“true score" assumes that we would really like to 
a 


determine the person's score on à Very large sample of behavior. For employ- 
1 know what proportion of words the stenog- 


ment pur like to 
poses, we would like 7 forms 
rapher will spell correctly during the next several years. We test perform 
ance on only one da and on only one set of words; this is a small sample of 
the total perf which we wish to estimate. In a school spelling test, 
performance e on a particular assigned 


the te i erformanc! 

acher ma to estimate P à 
test of w pecia. Apo day. But this estimate should ideally cover the 
Pupils “true” riba dge on that day, as observed on many, many trials. Any 


One trial on a particular word is a small sample of his genie 5s that 
Word. The true score is the average score the person would o = if the per- 
Ormance were observed by à Very long series of samples or trials ( assuming 
RO practice effect from the testing). Error 1s defined as the variation or fluc- 

ation of the person's scores within the series. It is a sampling error arising 


s what proportion of the test variance is 


nds on the length of the test. 
nds on the spread of scores in the group 


ably at one level of ability and unreliably at 


xceed the square root of the reliability 


This conception of 
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7 
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/ 
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f 
| 
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20 


0 1 2 3 
Relative Length of Test 


FIG. 22. Reliability and validity of the TMC as its length is changed. 


r : js un” 
creeps up so slowly that lengthening a test beyond a certain point 1s 


: oasure 
profitable. Beyond that point it is better to use the extra time to meas 
some other aspect of behavior. 

iabilitY 
4. A spelling test of thirty words has a reliability coefficient of .80. What reliabi 
would be expected if ninety words were used? "EL 
5. Which kinds of variation (Table 12) are reduced in influence when 4 
lengthened? d 


annot be à go? 

; validity: 15^ 
ever b 

f the 


Relation of Reliability to Validity. An inaccurate test C 
predictor. There is a rule which states how reliability limit 
correlation between the test and an independent criterion can D. 
higher than the square root of the correlation between two forms ia ud 
test. For example, if reliability is .64, validity cannot exceed .80. This S i 
clusion is derived from the formula relating test length to validity. i agy 
the test is so closely related to the criterion that the two would be perte se 
correlated if the test were free from error of measurement. That is, supP ace 
that when n becomes extremely large, r;,,, = 1.00 and fre = 1.00. The?» 
cording to the formula given above, Tee = Vr». than it 

Why is it that a test can correlate higher with a different measure 
does with its own twin? To understand this, consider two short spe 
and a "criterion" based on exhaustive measurement over several wee dt of 
thousands of words. Each test score is much influenced by random €T rests 
sampling and guessing, but the criterion is not. The errors in the pt 
lower the correlation between them. Just one such set of errors a 
criterion correlation, which therefore is higher than the test-test 


which both sets of errors affect. 


Jing 
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Reliability and Range of Scores. Errors of measurement are most trouble- 
some when the differences in ability or personality among the subjects are 
small. When we are hiring one person out of a group of applicants, the final 
decision often hinges on a difference of a few points between the best and 
next-best man. Slight errors of measurement in such a case might result in 
hiring the poorer man. If one is screening applicants for factory work 
merely to rule out incompetents, however, à less-refined test will be satis- 
factory. Errors of a few points cannot conceal the gross deficiencies in ability 
which distinguish the “hopeless” from the average run of workers. 

Assuming that error of measurement remains constant, we see from the 
formula given above that rii decreases when the variance of true score 


decreases, 
A test which has satisfactory reliability for use with a wide-range group 
1p. A rather crude mental test 


may be unsatisfactory in a highly selected grov 
can be used to identify which pupils entering school have mental handi- 
Caps; but to divide the handicapped group, determining who is to be placed 
In à special class, requires a much more accurate test. A criterion rating bya 
Supervisor may be adequately reliable for distinguishing failures from 
acceptable men but is not so good for telling which men within the satis- 


f 
actory group are truly best. 


6. The reliability of a test is .95 in a group for which s is 20. What will the re- 
liability be in a group where s is 10? (Compute se for the wide-range group, 
Gnd use this value to compute the reliability for the second group-) 


When a single reliability coefficient is 


Teported, we tend to assume that a test has the same accuracy for A pes E 
People. This assumption is often incorrect. Many tests are wp miss 
a. Certain levels of performance. The Gates Reading Survey i inei 
3-10, for example, gives reliable estimates of reading skill for pupi F> 
SN. When third.graders take the test, they find it so difficu a 
vay Breat deal of guessing. As à result, individual vague y ici dun 
third rade are unreliably reported. Tests, no matter how relia e E 
gU eie pupils whose scores are near the chance level. mea $ A us 
CCurate measures of individual differences in the extremely high ranges 


of talent, 


ion Ere 23 shows the scores of Navy re 
S a twice. If the test were a 
consist the same and all points woul ^ 
second. of 100 pairs of tones; in each P i 
taineq tone was higher or lower than E E 
fair] by pure chance. According to the 

Y reliable, Men scoring 85 on the first test 


Reliability at Different Score Levels. 


cruits who took a pitch-discrimina- 
the two scores for each man would 
all along the diagonal line. The test 
the man reported whether the 
rst. À score of 50 would be ob- 
tter diagram, high scores are 
fell between 72 and 95 on re- 
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80-83 9 5 7 28 18 14 
N 
76-79 2 6 6 18 29 12 $9 


4 
72-75 6 10 10,17 23 9 3 
4, 


Score on Retest 


- 92- 96" 
20- 24- 28- 32- 36- 40- 44- 48- 52- 56- 60- 64- 68- 72- 76- 80- 84- BE OF yp 
23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 


Score on First Test 


FIG. 23. Test and retest scores on pitch discrimination (Ford et al., 1944). 


" on 
scattered widely 9, 


test. But men scoring near the chance level (e.g., 55) w scores Í 


the retest (40 to 87). It is evident that the standard error of lo 
great. f men in 
The broken line shows the average score, on the second test, 0 esting 
each column. The upcurve of this line at the left is especially A score 
Many men with very low scores in the first test did well on the retest. such 
of 25 is too far below 50 to be a chance score. Probably men having ton 
low scores on the first test misunderstood directions and judged the mia 
instead of the second. Following directions correctly on the ern 
shift their scores from seventy items wrong to seventy items rig A Fig 
A test should be appropriate in difficulty for the decision to be ma g oup: 
ure 24 shows distributions of scores on several tests given the same at 


" $ rence 
The very easy test A may be quite satisfactory for measuring diffe 
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the] : 

EA i. E iss group. Test A is probably unreliable for the group as a 

iy ef tee “~ ation of only a few points causes a subject to drop from the 

cms g tip to the average. Furthermore, it does not distinguish be- 
persons tying at 100, even though they probably are not equally 


Test C 


FIG. 24. Distributions of scores for soveral tests given to the same group. 


ab 
is Fwd is difficult. The distribution is skewed in the opposite direction 
of the soal or test A; high scores spread out, but differences at the low end 
tribution e are too small to distinguish individuals reliably. A normal dis- 
ih a : such us that for test C, spreads out cases at both ends of the scale. 
equally kenge distribution, scores at the two ends are likely to be 
are prefe iable. For this reason, tests yielding roughly normal distributions 
Scale, Jf tred where it is necessary to distinguish equally well all along the 
eficient a. decision requires us only to distinguish the best men, test B is 
. If we need only to eliminate the poorest men, we could use A. 


7. Which at 
a distribution in Figure 24 would be most desirable in each of the following 
tj 
a. 
: Psychologist wishes to measure libe: 
P4 Voting habits. 
* A college wishes to pick out f 


C A 
à iw for college guidance measures interesi 
8. employer wishes to select the best statistician from a group of applicants. 


e Californi : 
whi California Test of Personality, Elementary contains several subtests, one of 
ch is Feeling of Belonging. A low score on this questionnaire is said to indi- 


cat | 
e maladjustment. According to the test manual, the percentile rank cor- 


res ^ 
Ponding to each possible score is as fellows oe 


Score 123456 
her tea pi eee M E 90 
ies would a boy's standing in the group change if his score changed two 
OInts? 
` What is the shape of the raw-score distribution? What does this distribution 
Mply regarding the usefulness of the test? 


ralness of attiudes, to study its relation 


reshmen needing special training in reading. 
t in medicine. 
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9. In World Series baseball, some pinch hitters reach batting averages as high as 
750, whereas the best regular players rarely exceed .400 for seven games 
How can this be explained? What principle regarding reliability does it il 
lustrate? 


Types of Coefficient 


o form of the test. 


A person's scores vary from time to time and from form t 
asuring pO" 


Some of these variations are regarded as a weakness in the me 
cedure, i.e., as “error.” But the meaning of “error” depends on the purpose 
of testing. If a score is supposed to indicate a person's temporary condition 
at the time of testing, it is desirable for scores to vary from moment to MO- 
ment. If the score is supposed to represent a lasting quality, moment-to-m? 
ment variation is undesirable. Consider, for example, a test purporting i 
measure rigidity of thinking. This might be used to predict epar pm 
scientist, to measure the level of adjustment of a patient during and z 
therapy, or to measure the effect of a particular stress applied during an pa 
periment. Should variation of a person's score over time be regarded as EX 
variance or as error variance? Stability would be a great adva 

dicting scientific success where we need to measure a lasting cl 


Since we want an estimate of behavior over a long period, inst 
ikewise, we 


ntage in P^ 
naracteristiC 
ability from 
nee 

there i5 
k rig 


week 


occasion to occasion is error, for this use of the test. L 
stable measure of rigidity to judge a patient’s status after therapy: * 
no point in knowing that he functions well today if he is likely to thin 
idly next week. On the other hand, if the therapist wants a Hey 
barometer of the patient's temporary state, stability would be a disac a 
tage. Likewise, to measure outcomes in a stress experiment, the test mus a 
sensitive to momentary states of the individual. Too stable an instrum 
would be of no value for these two purposes. know 

For a comprehensive understanding of the test, we would like te ate" 
what proportion of the variance can be ascribed to each of the four at 
gories of Table 12. We obtain such estimates by making two or more ane? 
ures of each person and then correlating the scores or performing a var i o 
analysis. Different experiments have to be made to measure each ed i 
variation. Figures 25 and 26 help to explain the various aa Um dia 
cients. Each experiment treats some types of variation as "error ; d 
grams these are left unshaded, while the nonerror variance is shade" y 

The first procedure to be considered is the retest correlation we f 
administering the same test on two occasions. This is called a coeff" Ge 
stability, because it tells us how stable this particular performance à 2 Je 
eral-lasting characteristics (e.g. in the TMC, general understanding, on 
vers) enter both measures. A person high in this ability tends to be 
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Dh idis V ipiis a. characteristics also affect both measures similarly; 
fils Bis ofa particular fact about air pressure on an airplane rudder will 
Piden m that particular item of both test and retest. This characteristic 

utes to variation between persons, not to variation between trials for 


em z 
porary Lasting 


General 


Specific 


Test correlated same test 


today with another day 


Retest Procedure 


FIG. 25. The coefficient of stability. Shading indicates the portion 


counted as true variance. 


the 

same s aaa 

ame person. Temporary factors (e-go health, or casual variation in 
asion and lower his score on 


fee may help an individual on one occ 
as iri They therefore lower the test-retest consistency and are counted 
The E in this type of reliability (Figure 95). 
dies coefficient of equivalence tells how well the test score agrees with 
equivalent measures made at the same time. It is obtained by giving 


Tem- í 
porary Lasting 


General 
Test correlated ^ similar 
today with test today 

Specific 


Internal- Consistency Procedure 


FIG. 26. The coefficient of equivalence. Sh 
counted as true variance. 


ading indicates the portion 


) in close succession. The 


arable, measuring the same general at- 
1 of difficulty. As Figure 26 shows, gen- 


Tal attri : ? : 
ere attributes affect both tests the same way. But since the tests include dif- 
z attribute like knowledge about the airplane rudder 
herefore lowers the correlation be- 


two 
t forms (e.g., Form A and Form B of the TMC 


o f 
ma should be closely comp: 
S at the same approximate leve 


elps items, a specific 
9n one form but not the other. Itt 
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tween forms. “Internal-consistency” procedures, which also lead to a coeffi- 


cient of equivalence, are discussed below. 

A third procedure, less commonly used, introduces an appreciable delay 
between test and equivalent test. Now, both changes in the person and sub- 
stitution of new items lower the correlation and are included in the error 
variance. This coefficient reflects both the stability and the equivalence of 
the measures. 

If all three coefficients are obtained, we can determine how mu 
test variance is due to each type of variation, but such information is 
often available. For the Mechanical Reasoning Test of the DAT battery: 
however, these correlations are reported (in various sources): 


ch of the 
not 


Form A with Form B, immediate .85 
Retest after three years, Form A A8 
Form A with Form B, three-year interval .65 


the various es- 


These facts permit us to construct Figure 27 by subtracting : 
asting-Z€? 


timates from 1.00 and from each other. As Figure 27 shows, only | 


00% 
Total variance L 
Delayed parallel test correlation. 65, hence 65°70 
Tenipuraor " True" (lasting — general variance) 
Geceral "Error" (TS TG +LS)..35°% 
7 " Immediate parallel test correlation. 85, hence 
Lasting Lasting d 
General Specific 8% “Truo” (LG TG).......85 o . 20°% 
65% Subtracting, temporary — general variance 
"Error" (TS L5)... 15 0 
Delayed retest correlation. 73, hence 
" True" (LG + LS)... 73/6 8*/c 
Subtracting, lasting — specific variance » 
"Error" (TG *TS).......17 0 X 7^. 
Subtracting, temporary — specific variance 
FIG. 27. Distribution of variance in the DAT Mechanical Reasoning Test. 
re^ 
: r s cor 
eral components count as true variance in the delayed Laid in the 
P - " s ; al. 
lation; therefore, 65 percent of the test variance is lasting and gener yg en- 
; : à rary- 
immediate between-forms correlation, lasting-general and tempor orar" 
5 s m 
eral variance are both counted as nonerror. By subtraction, the temP fuf" 


general variance must be 85 percent less 65 percent, or 20 percent. By jfic 


; c 
ther subtraction, we find that the temporary-specific and lasting P. qo 
components account for 7 percent and 8 percent, respectively. MC rath 
that most of the score variance is due to general abilities and habits añ of 
than to information specific to particular items. Quite a large propor eats: 
the variance is due to characteristics which remain stable over oa o" 


; forms © 
10. Prepare a diagram resembling Figure 26 for the delayed equivalent 


efficient. 
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T. A teacher gives a standardized test of knowledge of scientific facts to his class 
in chemistry. Several students make scores lower than he had expected. 

a. He asks, "Could it be that | gave a form of the test which included many 
questions these particular pupils happened not to know? Would their scores 
have changed much if they had been asked other questions of the same 
type?" What reliability coefficient answers this question? 

b. He asks, "Could the performance of these students be due to the fact that 
they were having an ‘off! day? Does a pupil's score on tests of this type 
vary much from day to day?" Which coefficient is most helpful in answering 


is. wa this question? 
* Which types of variance are to 
regarding emotional problems is used for this pu 
a. To select high-school pupils with whom the co! 
conference. 
b. To identify recruits likely to break down in service. 
€. To identify the area within which a pupil has conflicts, as a preliminary to 


13.0 a counseling interview. 
ne favorite method of estimating re 
Score them separately, and correlate. Th 
vented by the Spearman-Brown formula to 
oa that this is properly considered a 

14. unn only one form is used. 
d P^ can you say about the variance make-up of at 
elayed retest gives a coefficient of .70, and an imme 
15 a coefficient of .80? 4 
* Given these facts about a test measuring “liberality” of political attitudes, pre- 

Pare a diagram similar to Figure 27. 

Between-forms correlation at same 90 
Between-forms correlation, one year apart .60 
I . Retest correlation, one year apart — — " 65 
^ speaking about hearing tests for children a writer says: Physical and psy- 
chological changes from day to day may make tests at two sittings less valid 
than a complete test at one sitting. We find that we get worse results on cloudy 


ays than on sunny days.” 
i In what sense is the word valid used? Can you defend the contrary statement 
oan scores at two sittings would be more valid than a complete test at one 
iting? 


be regarded as “error” when a questionnaire 
rpose: 
unselor should have an early 


liability is to split a test in two parts, 
his correlation between half-tests is 
obtain a coefficient for the full test. 
“coefficient of equivalence” even 


est, knowing only that a 
diate parallel test gives 


sitting 


16. 


Co 
efficie 
nts of Stability 
any interval between tests from 


s are close together, the person 
s the retest 


Ret 
ew NES coefficients might be obtained with 
Will y minutes to several years. If the two test 
?member some of his former answers. This c 


Cor; 

elati : * 

e tion a little higher than the correlation betv 
sures, 


arry-over make 
veen two independent 


a test; rather, there is one for 
the lower the coeffi- 
Its for the Stanford- 


jlity for 
e between tests, 
we study resu 


lere ; > 

Cac e is no single coefficient of stab 
Cie; E interval. The longer the tim 
stability, as we shall see when 
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TABLE 13. Illustrative Reliability Coefficients 


Type of u 
Test Sample Procedure Reliability Coefficient 
School and Col- 370 high-school Kuder-Richard- Equivalence .93 (verbal) g 
lege Ability seniors son formula igi (quantitative) 
Tests 20 .95 (total) 
Pintner General 203 12-year- Odd-even cor- Equivalence .86 
Ability Test, old boys in two relation 
Non Language, communities 
Intermediate, 
Form K 
Short Employ- About 230 can- Testing with Equivalence .87 to 92 iioii 
ment Tests didates for parallel forms for three sec 
nursing school on same occasion 
Short Employ- 72 machine Retest after Stability 71 to .84 for - 
ment Tests operators in two years three sectio 
a bank 
Allport-Vernon 48 persons (not Correlation be- Equivalence .49 to .84 for 
Study of otherwise tween com- six scores 
Values described) parable half- 
tests 
Allport-Vernon Not described Retest after Stability .39 to .84 for 
Study of three months six scores 
Values 
t .90; 


Binet test. For 7-year-olds, the immediate retest correlation is abou 
declines steadily, so that after four years the retest correlation is only té 
and after eleven years it is only .68 (see p. 176). From this we can joe 
that at least 22 percent of the test variance at age 7 results from indivi " 


" s j ut Wi 
differences which are accurately measurable at the present moment P 


The tester must interpret information on stability in the light o rait 
poses. If he intends to make long-range predictions or to measure 3 pot 
which is supposed to be constant, he wants stability over long perio » 
other uses of tests, stability over a long time is of little importance 


be altered by time and experience. : 
y P f his pur 
t 


Coefficients of Equivalence 


The tester usually wants to know the person's standing on some g uak 
quality of which the test items are representative. Very rare ost 
ities to be measured so specific that they must be measured by J orso” 
set of items. In the TMC, for example, the aim is to measure the 4 very 
ability to solve virtually any mechanical problem. ]f scores depen ry re 
much on content of specific items, the test would be an unsatisfacto™ The 
dictor except for criteria to which this specific knowledge is relevä 
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Correlation between equivalent forms is therefore likely to be the most useful 
index of test reliability. 
Where only one form of a test can be given, an internal-consistency pro- 
cedure is used as a substitute for a between-forms coefficient. In the split- 
half method, the test is given in the usual fashion but then is scored in two 
Parts. It is necessary that the two halves be independent, so that success on 
an item in one half does not help with an item in the other half. Correlating 
> two parts gives a coefficient of equivalence for the half-tests, and the 
pearman-Brown formula is often used (with n = 2) to obtain the coeff- 
cient for the full test. A better estimate is obtained by the formula 


sd So 
ea ol l= 5 


$i 
Where s, and s, are the standard deviations of the half-tests. The two for- 
mulas give very nearly the same result in most studies. The coefficient ob- 
tained is an estimate of the coefficient of equivalence between two full- 
ength tests which are as similar as the two halves are. 
Two internal-consistency formulas developed by Kuder and Richardson 


âre often used to obtain coefficients of equivalence for tests where one point 
] «er. : 
er and zero for a wrong answer. Kuder-Rich- 


est which is equal to the aver- 
R90 coefficient may generally 
Jent-form correlation. Coeffi- 
sing each item, and from 


aler 


M oir every correct answer a sh 
age of Soa 20” gives a coefficient abb 
e tile possible split-half coefficients. The - 

n as a good approximation to an equiva 


cient KR20 i r tion pas 
tl 20 is computed from the propor » 
Xe standard deviation of scores (Guilford, 1956, pp. 454-455). 


lhe second formula, KR21, is less accurate but very simple to compute. 
ne formula can be used by any tester to get quick eg of 3 — 
nt of equivalence in his group, if his test is scored by the e er right 
Ormula, The quantities us ed in the formula are the mean (M), the number 


of items 2 iation (s). 
(k), and the standard a ME- -= 


1-— — 


very nearly the same result as KR20, 

"t sometimes it gives a much lower coefficient. When the two estimates 

iffer by ala E decision as to which coefficient is most relevant 
arge amount, the dec not treat here. 


“pen ; hich we cam 
ds upo ; ; tions whic 
n technical considera ; — 
" nternal-consiste cedures t be used with speeded (time 
ncy pro 


canno 

Nit) tests, bec of the test are not independent. A person who 

Lets diac = ause the parts caede time on it will fail to reach the items 

at the wal Eu item and spen. jation of items within a trial is therefore 
igher i e test. The n fred separat 


Spli han the correlation betwe F ly ^ xor pom E me 
Plit-ha ability is spurious Y igh. The Primary 
If or Kuder-Richardson reliability 


ree = K— d 


Or 2 a 
Most tests this formula will give 
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Mental Abilities Tests for Ages 11 to 17 are given with short time limits, 
yet the manual reports only split-half reliabilities. Anastasi and Drake 
(1954) administered the half-tests with separate time limits in order to get 
a proper estimate of reliability, and compared these with reliabilities com- 
puted by the spurious single-administration method. The results are as fol- 
lows for the four PMA tests: 


Verbal: separately timed halves .90; single administration 94 


Reasoning: 87; .92 
Space: T5: .90 
Number: .83; .92 


It is obvious that the split-half estimates from the single administration are 
inflated and give too favorable an impression of the test. 


17. A classroom teacher gives a forty-item test with mean 34 and s.d. 3. What 5 
the KR21 reliability? 


APPEAL TO THE LAYMAN 


When a patient loses faith in the medicine his doctor prescribes, 
much of its power to improve his health. He may skip doses, an 
he may decide doctors cannot help him and let treatment lapse 4 
For similar reasons, in selecting a test one must consider how worth wh 
will appear to the subject who takes it and to other laymen who will see 
results. 

If an applicant for a job is given an employment test which he cor 
silly or unrelated to the job, he is likely to be resentful. This will make it @ 
cult to obtain valid scores. If he is not hired, he may excuse his failure h 
criticizing the test; what he says to his friends damages public relations il 
makes it harder to obtain job applicants. Even the successful man may 0 
that he was hired in spite of the test, and begin work with antagonism 
ward management. Some satisfactory workers have had little schooling ý 
are distrustful of tests which probe their weaknesses; catch question $, 
questions which seem childish are especially likely to arouse criticis? — 

If a test is interesting and “sensible,” taking it is likely to be a pum. es 
perience. This not only tends to make the scores valid but also uc oH An 
tablish good relations between the personnel worker and the subj? cities 
Italian bus company contracted with psychological laboratories 1n à as 
to give tests to applicants for jobs as drivers. After a few months, uch 9 
found that most of the applicants were traveling to Rome— going 25 ni or 
100 miles farther than necessary—because the Rome center had &'2 " 
testing apparatus while the second center used simple equipment to 


e considers 
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ur : " , 
e the same aptitudes. The applicants thought the elaborate tests fairer and 


more dependable. 
British experience with W 
point. The selection board observes candid 


testi i 
wee apparatus tests, discussions, etc. Before this s 
en from the ranks rarely applied for commissions because they thought the 


"rm a used gave an advantage to applicants from good homes and good 
P dim is They think the selection board is a fairer system where a man can 
is true ability, and this attitude has been of great assistance in recruit- 
or of officers—so much so that, although the board system is costly, and 
y possibly less valid than objective tests could be, there is no thought of 
changing it. 
The subject is not the only 


ar Office selection boards is a second case in 
ates during several days of field 
ystem was established, 


one who must be satisfied with the psycholo- 


id tests. The British selection program has had to satisfy a Labor cabinet 
s etx that poor boys have a fair chance to become officers, the parents 
€ boys tested, and the old-line officers who train the accepted men. A 
Poola git who installs a highly valid industrial selection program will find 

in the ashcan a year later unless he convinces both management and 


theuni "udi 
he union that the test is fair. Users of test results have strong prejudices. If a 
group of social workers is accustomed to mental test A, the psychologist 

will encounter difficulty. Even if test 


i decides to substitute mental test B 
ao accurate than A, the social worker may disregard results from B 
that om it does not have his confidence. So important is user acceptability 
physi de psychologist working with teachers, industrial personnel men, or 
htec must often use a test which would be his second or third choice 
n iie of technical qualities à 
idity est which looks good for a partic 
< Adopting a test just because it 


entif : 
ie. practice; many a “good-looking ivan 
ice examiners, for example, prepared two tests to measure ability in al- 


ie filing. One gave five names per item—John Meeder, James Med- 
an, Eleanor Meehan—and asked 


Wa 
xi diee Madow, Catherine Meagan, iced 
name would be third in alphabetic oree : test required the 


alone. 
ular purpose is said to have "face va- 


appears reasonable is contrary to sci- 
» test has failed as a predictor. Civil 


subje : 
ject to place a name in the proper place in a 
R cere ee 

obert Carstens rem Carreton 


d x 
Roland Casstar 


(au 
Jack Corson 


p-———- 
Edward Cranston 
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Though the makers were confident that the tests were representative of the 
same skill, and though both tests had reliabilities above .80, they correlated 
.01 (Mosier, 1947). 

Such evidence as this (reinforced by the whole history of phrenology. 
graphology, and tests of witchcraft!) is strong warning against adopting a 
test solely because it is plausible. If one must choose between a test with 
"face validity" and no technically verified validity and one with technical 
validity and no appeal to the layman, he had better choose the latter. The 
job of the tester is, after all, to get information which improves decisions. 
The tester should seek and usually can find a test which has both face va 
lidity and technical validity. 


18. If a clinical tester is examining a criminal to establish whether he is mentally 


responsible, he may have to present his results in court and stand — 
examination on them (l. Frank, 1956). In what ways might his choice of d 
differ from those he would use in examining a similar case at the request o 
hospital psychiatrist? 

19. A certain examination for French secondary-school admission was 
made very difficult to obtain a skewed distribution, since only a 5 
of places was to be filled. When the children told of the questions 
parents organized protest meetings which ultimately brought the pro 
the attention of the Minister of Education, who decided to give a secon 
to those who had failed. Do you agree with this decision? 


deliberately 
mall number 
at home: 
blem t° 
d test 


PRACTICAL CONSIDERATIONS 


Ease of Application 
administered by 
The test 


d simple 
The 


ands 


In almost any field, one can choose between tests to be 
untrained persons and tests which can be given only by an expert. 
which is simpler to apply will have more complete directions an 
objective scoring, and requires no observation or judgment by the tester 
more complex test offers more comprehensive findings, but only in the h at 
of a well-qualified tester. Attention should also be paid to the ii 
with which the manual assists the user in drawing conclusions from tes 
sults. This is especially important when a psychologist is choosing ? 
whose results many other persons will consult. 

A test manual may present all the important inform 
and yet fail to communicate to the reader. Lennon found, seti 
large numbers of schoolteachers fail to grasp even simple factual stato ation 
in an achievement test manual ( Lennon, 1954, pp- 90-94). The impi pto 
is that a person in charge of a testing program must make a nan on the 
educate all those who will give or interpret tests, rather than to rely 
manual to convey the insights they need. 


est 
ation about ue pat 
indeed, 
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Equivalent or Comparable Forms 


We have noted, in connection with reliability, that equivalent forms are 
often available. Equivalent or parallel forms are tests measuring the same 
thing at the same level of difficulty, so that equal raw scores have the same 
Meaning on each form. They are especially valuable when each person is 
tested twice—for instance, before and after therapy. The use of new ques- 
tions rules out the effect of memory. An equivalent form is useful also for 
checking a dubious score. A second test would be given, for example, when 


t - M : 

he tester suspects that emotional disturbance spoiled the first test. 
Two tests are said to be comparable when their raw scores can be con- 
hool achievement tests are 


verted to the same derived-score scale. Some se 
Organized in a series at different levels of difficulty so that the pupil may be 
tested each year. Although the tests are not equivalent, a scale is provided 
80 that performances on the easier test and the harder test can be compared 


t 4 sis H 
k determine the pupil's gain. Another type of comparability is seen in the 
AT profile chart, which permits comparison of mechanical aptitude with 
9 yrs 
ther abilities. Such comparisons, based on the same norm group, have an 


obvi : 
lous value for interpretation. 


Time Required 


avs limited, and therefore short tests are 
J 


Time available for testing is alw. 
nga test 


referr i i E 
oa other things being equal. Too lo ing period bores the sub 
ct and makes him uncoóperative. Where morale is high, however, one can 


uto 
ie Very long testing batteries successfully. We have already seen that re- 
ds on test length. Shortening 


liabi]; 
bility, and to a lesser extent validity, depen » Sh 
but not much is gained by 


tes 
ts to a few items will destroy their value, H 
r score except in competitive exami- 


at difference. The Bennett TMC has 
my d a nt subjects. 
y items and requires about t most adolescent subj 


à usual length for a tes 


Mult; 
Ultiscore vs. Single-Score Tests 


scores has to be longer than a single- 


A te 

st or battery yielding several 
Score test, if its oe oa be reliable. It is difficult, however, to state 
Whether a multiscore test js superior to * single-score test occupying the 


Same time, The single score is likely to be much more reliable than the sev- 
Stal scores of i test. The tester who needs several facts about the 
“dividual may prefer to obtain somewhat unreliable answers to all these 
uestions rather m to measure one dimension precisely and remain with- 
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out information on the others. A high-school counselor would obviously pre- 
fer ten-minute tests of five specialized abilities to a fifty-minute test which 
measures mechanical reasoning very precisely and tells nothing about the 
other four. He probably should not go so far as to substitute ten three-minute 
tests for the thirty-minute TMC—but it is hard to define a perfect balance 
between breadth of coverage and precision. 

When many decisions are to be made, each requiring a different sort of 
information, the best solution is to allow a large amount of time for gathering 
information. There are limits to this, however. In clinical diagnosis of oe 
dren with behavior problems, for example, one could think of hundreds © 
tests and observations which might shed light on different aspects of devel- 
opment. An employer hiring an executive can likewise raise a very large 
number of questions. While no general rule can be given as to the best divi- 
sion of limited testing time, it is clear that the greatest amount of time shou 
be given to the most important questions. Where there are several question? 
of about equal importance, it is definitely more profitable to use a uu 
test giving a rough answer to each one than to use a precise test which à! 
swers only one or two questions (Cronbach and Gleser, 1957). 

The disadvantage of quick, crude measures disappears when 
them a first step in a sequential measuring program. In hiring emp d 
for example, the very poor prospects can be weeded out by a rather qe 3 
rate pencil-and-paper test, and sometimes those who score very well on P: a 
a test can be hired at once. Then only the applicants near the border - 
need be given an accurate and more costly test. In testing an individual of 
guidance or diagnosis, we can begin with a multiscore test covering ug 
variables (a battery of short aptitude tests, or an interest measure cover 
all fields). A further, more precise test can then be used for any en ore 
(e.g. clerical aptitude, or interest in speaking activities) which looks imP 
tant on the basis of the first results. 


we make 
nployee* 


Cost 


sg 
i ting 
The cost of the usual test is only a few cents, but when one is o ption 


large number of persons, a difference in cost may be worth some at pps 
Fortunately there is little relation between the cost of tests and m: ja 
ity, so that even a limited budget permits the use of well-constructe dare 
Cost is greatly reduced where it is possible to use an answer sheet an ts pe 
usable question booklet. The reusable TMC booklet costs about 18 ce” ts. 
copy (in packages of 25). The answer sheets cost an additional 4 ce” the 
determining the cost of a test, one must consider not only the cos 
materials but also the cost of scoring. d by the 
A fairly representative figure on costs of testing is suggeste 
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charges of a nonprofit test scoring and rental service operated by a state uni- 
Versity for the high schools it serves ( Unit on Evaluation, 1953). Costs in- 


clude shipping, handling, materials, scoring labor, and other items. A school 


can rent the typical test booklet for 3 cents and purchase the answer sheet 
for about 3 cents. The scoring charges vary with the number of scores ob- 
about 4 cents per pupil up to 15 cents. This 


tained for each sheet, from 
means that guidance testing for sixty pupils, measuring five aptitude and 
der $30, including all costs ex- 


achievement areas, would cost the school un 

Cept teacher time for administering the tests and time used to interpret re- 
Sults to students. Since the decisions for which the tests are used make a 
great difference in the educational efficiency of the school and the soundness 
9f the pupil's life plans, cost of testing should be given very little weight 


in choice of tests. 


EVALUATING A TEST 


We have now introduced nearly every concept that is used in judging the 
adequacy of a test. Subsequent chapters will describe the various types of 


Eos and apply these concepts. Int 
Ex more completely. Even thoug 
Shure, attempting to draw conclusions about particul 
marize the concepts and present a form useful in evaluating any test. 
og development of a testing program requires, first of all, a clear pur- 
ied we pointed out earlier, one must search pe» tast that m pa 
test "T E made, not just for “a good test of reading or “a goo Lar a T 
stra, t is unrealistic for the student of testing to evaluate a test in the ab- 
act, yet one cannot consider all possible applications simultaneously. For 
in reason, it is suggested that any test manual be approached with a defi- 
o measurement problem in mind. Our form carries a space for entering 
ad urpose, which might be specific (selecting girls for training as punch- 
ard clerks) or rather general (obtaining information for subsequent use In 
“ounseling high-school pupils as problems arise). 
si Ordinarily the tester's situation restricts the type of test that aay M jl 
d €red. It determines the choice be p and individua hee 
S 9r ability range, and the level of interpretative skill to be E . Thus a 
S46 be gen rt stre hah es), ne 
i i ui is a a 5 a 
o Ses Lco individual testing "o pe li and will have to be suitable 
E e oo normal PoP ty b ]l teachers and adminis- 
a "terpretation by counselors and probably Di ivi 
tors. With süch.eride specifications in mind, one tu 


rns to publishers’ cata- 
98s, the Buros Yearbooks, texts 0n measurement or applied psychology, etc., 
Takes a list of tests pH consider. The form presented in 


tween grou 


Table 14 is a 
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convenient record of facts and opinions regarding tests examined in detail. 


We have numbered the entries to make it easier to connect our di 


with various entries. 


The top section (entries 1-9) includes the simple descripti 


scussion 


ve facts which 


TABLE 14. A Form for Evaluating Tests 
1. Title 16. Predictive validation in number 
2. Author and type of cases, result 
3. Publisher 17. condi tunt validation (criterion, number 
4. Forms and groups to which applicable and type of cases, result) indicating 
5. Practical features 18. Other empirical evidence n ! 
6. General type what the test measures — ar- 
7. Date of publication 19. Comments regarding validity for P 
8. Cost, booklet; answer sheet ticular purposes 
9. Time required " | con- 
20. Equivalence of forms or interna 
10. Purpose for which evaluated sistency (procedure, cases, result) time 
21. Stability over time (procedure: 
11. Description of test, items, scoring interval, cases, result) 
12. Author's purpose and basis for selecting 22. Norms (type of norms, cases) of re" 
items 23. Comments regarding adequacy i pur" 
13. Adequacy of directions; training re- liability and norms for partic 
quired to administer pose 
14. Mental functions or traits represented in 
each score 24. Comments of reviewers 
15. Comments regarding design of test 25. General evaluation 
26. References 
.ex- 
can often be obtained from the catalog. They are for the most pst se abo 
planatory. It is suggested that you enter (6) one or two words to E cate- 
the general type, so that completed analysis forms may be filed 1) we 


ough revision of the manual. 


( This is one of the places where publishers sometimes introduce 
ing information to make a test more appealing. 
the test manual every year so that it looks up-to-d 


( Table 15. i 


Jead- 
slea 
mti ] t 


ion 

ri 
It is possible to copy ie 
ate, even though ? wa 


ade 
change has been made. Such embellishments will not confuse the rea palf- 


less he gives undue weight to superficial values. There 
truths or misleading claims in manuals. Some can be s 
reader, while others are identifiable only by an ex 
that the manual or test advertising is untrustworthy in one respec 
of course view all the remaining information with suspic 


and ethical quality of the manual, however, 


ch 
are many SU ert 


potted by any and! 


»ader 
pert. ]f the reac mus 


P tific 
cien 
The SC" gal 


a sign of the 


ion. 


is not always 
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e iain ut a few excellent tests have manuals which are open to severe 
bos Lor in claims.) 
the s Step is to form an impressi 
mers naam and the aims the 
lin edict E p which largely determines the appeal of the test to 
itens one Ss other laymen. In the form (11), one can describe the 
Süd be. un y and should also list the titles of auptedts to be separately 
The RS "e should be given to the objectivity of scoring. 
ths text. © r t stated intentions (12) help one to understand the nature of 
though ngon be hesitant to use it for a quite different purpose, al- 
weiber i is sometimes defensible. The manual will usually indicate 
classroom = author was interested in selection, guidance, clinical use, or 
midia sales and will often tell what aptitudes or traits he had in 
test is eL MF items. The source of items 1s particularly important if the 
iino : E interpreted on the basis of content validity. a 
most Clics manuals report statistical studies used in selecting items. The 
mon procedure is to correlate the item score with the total score 


0 

" ^» E discarding items which do not seem to measure the same thing 
Since it " Though this procedure is likely to improve 2 test— particularly 
each othe iminates ambiguous items and makes the items more similar to 
Tange of a does not necessarily improve validity. Indeed, narrowing the 
Validity þ ontent (in a mechanical aptitude test, for example) can lower 
lations , y covering the field less thoroughly. For this reason, item-test come 
Sumer 2: never be referred to as “jtem-validity coefficients. The con- 
tion, tally cannot evaluate the technical procedures used in test construc- 


on of the test by examining the items, 
author had in mind in preparing the 


Direct; j ; 
irections (13) can be examined with regard to their clarity and the ex- 


tent 
t à 
" © which they standardize the test. . 
armchair analysis of the test items should be made (14) to judge what 
ity traits influence the score. 


abili 
iul : experiences, work habits, oF personality tr Huer j 
ach of the subscores which is to be inter- 
ting variables, but the 
ty studies and helps in 
ore seems to indi- 


also list irrelevant 


Preted, »» nalysis is required for € 
Hort tiene cannot hope to identify al 
interpreti s questions to be used in interp me 
fate, hatin the test. The report should state what the sci 
Variables i this is what the author m 
Y desir aa are likely to distort scor 
hier le ii take the test oneself or to 
mpiri ve his performance. 
Proviqeq = evidence of validity (1 ay d 

arin or predictive and concurrent validation, an 
Partie on construct validity. Some of these spaces may 
ar test or a particular purpose and if so would no 


es. As à p? 
administer i 


various sorts. Spaces are 
also for other studies 
be irrelevant fora 
t be filled in. While 


6-19) may be of 


TABLE 15. Evaluation Form for the TMC 


BeOS 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Title. Test of Mechanical Comprehension 

Author. George K. Bennett (with Diana Fry, Wm. A. Owens in some forms) 
Publisher. Psych. Corp. 

Forms and groups to which applicable. 

AA: high-school boys, job applicants 

AA-F: French language form of AA; questions and directions in both languages 
BB: experienced workers, advanced students 

CC: engineering students, high-level job applicants 

W-1: high-school girls, female job applicants 

Practical features. Can be machine-scored. 

General type. Aptitude 

Date of publication. 1940, 1947 (AA); 1941, 1951 (BB); 1947 (W-1); 1949 (CC). 
Cost, booklet, 18¢; answer sheet, 4¢ 

Time required. 30 min. 


. Purpose for which evaluated. Vocational guidance of high-school pupils. 


Description of test, items, scoring. Pictures of simple apparatus. Questions In s of 
form (5-choice in CC) as to what will happen to an object when force is applied, 
two structures is most stable, etc. Objective scoring (R — 3 W; Rin CC). Only one 
score obtained. or equired 
Author's purpose and basis for selecting items. Intended to measure an ability 7 ¢ the 
in many jobs and training courses. Past experience is allowed to affect scores rious 
items require understanding rather than rote knowledge. Items were put throug rs (0n 
stages of criticism and tryout; items retained were those discriminating high score 


the total test or a pooled Mech. Comp. score) from low scorers. T clear 
Adequacy of directions; training required to administer. Directions are uns d ivolv 
and simple. Classroom teacher can handle. Answer sheet for all forms save 

awkward matching of arrows to line up with booklet. n achine* 
Mental functions or traits represented in each score. General experience with m Formel 
common in Western world, understanding of simple principles of motion, energy" goros 
physics helpful but not required. Solutions can be intuitive or deductive; more 
deduction required in CC. Unspeeded. sc rind 


Comments regarding design of test. Highly efficient. Use of correction formula i innate 
unnecessary but harmless. Note that no claim is made that the test measures a 
aptitude. 
3 ren 
Predictive validation (criterion, number and type of cases, result). Manual gives m 
to numerous military and industrial studies where TMC was correlated with 9 
technical training or job ratings. Coefficients range from .30 to .60. 
Evidently generally useful, though the test usually must be suppl 
mental or verbal measures. Information is lacking on usefulness of the test for Pd 
in high-school courses, or on long-range predictions from high-school testing- s , with 
are available for DAT version. Form CC has validities .28 to .50 for college fres ared te 


A È Ae " i comP 
performance in technical courses as criteria. Note that range is restricted, 


al 
en 
emented by 9 jicio" 


upre” 
sult). Some of the ing 
to time separ 


5 
test and criterion lice 
iras 1 apP j, on 
Other empirical evidence indicating what the test measures. Study of 1471 oF s.d. 9 


for fireman-policeman jobs shows that high-school physics raises scor 
AA. (This information is needed—and not now available—for CC, where € 
or math might be greater.) ; 
E ups: 
Form AA correlates about .50 with general mental tests in wide-range Sra P overnite 
CC correlate .20—30 among applicants to engineering school. Cons! with Colle 
with spatial tests (.50 with Minn. Paper Form Board in wide-range group, ~ 


TABLE 15 (Continued). 


9. c 
omments regarding validity for particular p 


20. Equi 
- Equi z r 
quivalence of forms or internal consistency (procedur 


23. C. 


l. Stabil; 
tability over time (procedure, 


Board z H " 

oe test among engineering-school applicants). Correlation of .30 with tool 

tha ae Factor-analytic studies? show a mechanical experience factor prominent in 
; there are also substantial loadings with general mental ability, and spatial or 


visualization abili 
ility. 
urposes. Test has predictive value for jobs 


tion. Overlaps general and spatial tests, 


or cou i i i i 
rses involving nonroutine machine opera 
tion. Lack of data on predictive 


so that its i 
Bowes its independent value would depend on the situa 
in high school limits interpretability. 


Fa. c e, cases, result). Split-half method, 
Rae ok 500 ninth-grade boys, r = .84. Similar coefficients for other forms, lower in 
o i i 
fénoriud. restricted range. Interform correlation about .80 for BB vs. CC; no others 
No time interval, cases, result). No information presented. 
diia (type of norms, cases). Each manual offers several columns of percentile norms for 
rt: s groups in schools and industry. Selection of groups is poorly described (e.g., "833 
Gone boys," "417 applicants for unskilled jobs"). For CC, tables are given sep- 
or for two specific engineering schools. 
iene regarding adequacy of reliability and nori 
ity of .80 is satisfactory but low. A second test sh 


diffe 

P ? e 

The nee in score would alter a decision. 
norms presented have limited usefulness; the counselor would have to obtain norms 


for hi 
Man school, for special courses in the school, and if possible for the local job market. 
nce of information on stability prevents confident use of the test for long-range pre- 


dicti i: 
ictions earlier than twelfth grade. 


ms for particular purpose. Re- 
ould be given if a few points’ 


24. C 
omments of reviewers 
eness and honesty. - - + There is little 


25, 


are models of concis 
ion of many mechanical principles, but its value 


ground that several items involve principles or 
ical experience, outside of a 


u 
dope of directions 
foe ih test measures comprehensi 

iction has been questioned on the 


fa i 
iyu idi one is unlikely to encounter in everyday mechan 
Ln course" (Charles M. Harsh, in Buros, 1949, p. 720). 
e Test of Mechanical Comprehension . . + should prove to be a useful tool es- 
It should also 


vocational guidance. 
dustrial employment office. It is an 
Il the forms appear to have been 
f usefulness of the test will un- 
i " (George A. Satter, in 


nce of validity has appeared since 


fa ee to those persons engaged in educational and 
Pp eia usage in the technical school and the in 
well ive test; the items are intrinsica 

constructed; and they are 


ry increase as more Va 
5, 1949, p. 723). (Note: Considerably more € 


th 
€ date of this comment.) 


lidity data are ma 
vide 


eptionally popular test, various versions having been 
against : diction batteries and having repeatedly shown value 
and la mechanical criteria. The concreteness of the test makes it appealing to subjects 

Papap when used in guidance, it dramatizes the concept of special abilities. 
Seniors c could be used in ninth grade to exp! itudes, and CC could be used with 
isa eee engineering or technical courses. The DAT Mechanical Reasoning Test 
FRE e of the Bennett TMC which should be preferred in high-school guidance, for 
school 5 reasons: superior norms, more substantial validity information, against high- 
daig ^ i comparability to other tests of the DAT battery, more information on 
e Compared to other mech and DAT are less de- 
and lent on either shop experien asure of understanding 
i intellectual trainability; it does thout training, nor skill 


in m 
anual performance. 


sts, the Bennett 
The test is a me 
ee proficiency wi 


anical aptitude te 
ce or dexterity. 
not guarant 


considered in Chapters 9 and 10. 


* The 
meaning of information of this sort will be 
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most users examine only the evidence presented in the test manual, one can 


evaluate the test better if he considers data published elsewhere. For some 
tests, the volume of research is so great that it can only be summarized or 
sampled. Under heading 18 any study might be listed that helps to estab- 
lish what the score measures. One might list, for a mechanical aptitude test, 
evidence of its overlap with general intelligence, of its degree of speeding, 
or of the extent to which physics students earn better scores than those who 
have not studied physics. Factor analyses are often relevant to this question: 
Here specifically, it is necessary to select the most important information 
from that available. : 

The final evaluation of validity (19) is the most important single entry ™ 
the form. Does the test give the information needed to make the intende 
decision? What degree of confidence can be placed in it? What level of BL 
chological training is required to interpret the test as proposed? To iiem 
such an integrated conclusion, it is necessary to weigh positive and n el 
evidence, to decide which of several contradictory findings i pei 
worthy, and to judge the adequacy of the body of evidence as a whole. It 
especially important to note what necessary evidence on validity is lacking: 

The next major section (20-23) considers reliability and norms. This - 
formation is usually presented in the manual and needs only to be summ", 
rized. The most common faults in reliability information are failure t repo 
subtest reliability, and application of internal-consistency formulas Re 
speeded tests. Norms must be examined critically for representativene ^ 
and for relevance to the user's own situation. 

It is of course important to examine whatever critical reviews a xt 
able, and the record form includes a space (24) for quotations which S" = 
marize the reviewer's evaluation. The general evaluation (25) is a final eon 
mary of the advantages and limitations of the test for the particular purp? » 


dU s " P ^ -opriate 
considering both its technical and its practical features. It is appropri? can 
e ci 
n. 


ed with the 


s most 


re avail- 


compare the test with others having the same general functio 
also point to supplementary information which should be combin ring 
test. Special ways of applying the test, over and above its use as a pma 
instrument, should be noted. These would include making supplemo^. 
observations during the test, examining responses to obtain cues for 
nosis, using test responses as a point of departure in a counseling inte 
etc. the 
An analysis of this sort for every test under consideration (ratte n 
analysis is put in writing or not) provides a basis for a total testing pros as 
A program is more than a list of good tests. A program will be ae e 
to minimize wasteful overlap and timed so as to get each piece of s al 
tion when it will be most helpful. Testing cannot be planned by pee 3 
industry or the armed forces it must be dovetailed with recruiting, tra 


rview? 


im 
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ng must be considered as part of the 
ates what tests will be given 
ng the person or in 


and assignment. In the clinic, testi 


whole therapeutic effort. The final program st 


and when, and how the results will be used in assigni 


helping him to understand himself. 


Suggested Readings 


Anastasi, Anne. Test reliability. Psychological testing. New York: Macmillan, 1954. 


Pp. 94-119. 
This textbook chapter covers essentially the same principle 
about reliability as the present chapter includes. 
Rothney, John W. M., & others. Test scores: etiology and interpretation. Measure- 
ment for guidance. New York: Harper, 1959. Pp. 116-150. 
How to read a test manual and test advertising critically is discussed, with 
numerous examples. These authors maintain a severely critical attitude toward 
tests, demanding a closer approach to perfection than does the present text. 
Wesman, Alexander G. Reliability and confidence. Test Serv. Bull., 1952, No. 44. 
(Available on request from Psychological Corporation. Also reprinted in H. H. 
Remmers & others (eds.), Growth, teaching, and learning. New York: Harper, 


1957. Pp. 449-457.) 
In a simple presentation Wesm à 
tation of reliability coefficients reported i 


s and techniques 


an covers the major difficulties in the interpre- 


n test manuals. 


PART TWO 


TESTS OF ABILITY 


uu 


Measurement of General Ability: 
The Binet and Wechsler Scales 


T 
HE EMERGENCE OF MENTAL TESTING 


Tests Before Binet 

f scientific measurement of individual differences 
ental test. Despite the overenthu- 
ded its development, the general 
ant single contribution of psy- 
airs. Among mental tests, none 


se outstandin g success o 
siasm "i T has been that of the general m 
menta] Cs occasional errors that have atten 

al test stands today as the most import 


ch 
ee to the practical guidance of human aff 
s been more influential than that fathered by Alfred Binet. A history of 


Mental testing is in large part a history of the Binet test and its descendants. 
aris systematic experimentation on individual differences in behavior 
bem rom the accidental discovery of differences in reaction time among 
unco e In 1796, an assistant named Kinnebrook at Greenwich Observ- 
ed was engaged in recording, with great precision, the instant when cer- 
Seve crossed the field of the telescope. When Kinnebrook's results were 
of his to be consistently eight-tenths of a second later than the ann 
work superior the Astronomer Royal, he was thought incompetent in his 
stud es was discharged. Not until twenty years later did more careful 

ite s how that the differences between observers were the result of the 

fi si à speeds with which they could respond to stimuli. Only gradually 
natur ch differences come to be recognized as significant facts about human 

e, rather than as annoying errors contaminating scientific work. 


enti Sologists biologists, and anthropologists were stimulated by the sel 

us climate of the nineteenth century to make a great variety of gusano 

. Mts of human characteristics. Notable among these early workers was 

de Francis Galton, whose interest in differences among individuals devel- 

Ped from Danii newly published theory of differences among species. 
157 
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During the latter half of the nineteenth century, Galton invented ways of 
measuring physical characteristics, keenness of the senses, and mental im- 
agery. These methods, though not developed fully by Galton, served as 
models for later tests. In addition, Galton demonstrated that outstanding in- 
tellectual achievement tended to occur frequently in certain families. Gen- 
ius, evidently, was not an accident or a gift of the gods, but a natural phe- 
nomenon to be investigated scientifically. 

At this time, psychology was only beginning to emerge as an objective 
science. Mental processes, it was suggested, could be observed under stand- 
ard conditions by an experimenter. Scientific observations, supplementing i 
even replacing philosophical speculation, could provide an exact descriptio 
of the relation between the mental and physical worlds. This was the aim 
with which Wundt opened the first psychological laboratory in Leipzig, 2° 
he and his colleagues did triumphantly establish quantitative psychologic? 
laws comparable in form to those of physics. Believing that psychological 
research should analyze behavior into its simplest elements, he designe 
techniques for measuring very limited functions. Wundt, trying to establish 
the general laws governing all minds, was not concerned with individua 
differences. His laboratory procedures and particularly his interest in quan’ 
titative research, however, had a strong influence on early tests. In the 

, United States as early as 1890, J. McKeen Cattell was using a mixture of pr? 
cedures from Wundt's and Galton's laboratories to measure sensory 
strength of grip, sensitivity to pain from pressure on the forehead, and me 
jory for dictated consonants. Cattell was first interested in the range wd xi 
dividual differences as a laboratory problem, but he quickly became excit? 

about the practical value of identifying superior individuals by means ° 
these procedures. 

This line of effort unfortunately met an early debacle when it was disco” 
ered that the new tests measuring simple elements of behavior seeme 
have no relation to significant practical affairs. The crucial study W25 
ler’s work on test scores of Columbia students (Wissler, 1901). Hec s 
lated college marks with the Cattell tests, finding such negligible corr elatio" 
as the following: reaction time, —.02; canceling a’s rapidly on 2 pP v^ 
page, —.09; naming colors, .08; auditory memory (recall of digits), ^^ pat 
now recognize that low correlations were certain to result, no matter ge 
mental functions were tested, because Wissler's brief tests were quite ich 
reliable, especially in his highly selected group. The disappointment pe ed 
followed the Wissler study, however, delayed attempts to base 4? app" 
psychology on the findings of the laboratory. im 

Wundt tested elements which could be precisely defined, using SU va 
which could be accurately controlled in the laboratory. The tests ae of ? 
lidity in the same way that a chemist’s measure of the freezing por 


iss" 
corre 


uli 
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substance has validity; the result describes a clearly defined characteristic 


and is readily interpreted at a superficial level, no matter how much remains 


to be learned about the underlying process. Tests of this sort have an ob- 
f ation in the laboratory spins 


Vious content validity, and continued investig 
àn ever stronger web of theory between these measures and important 
Constructs. Their validity for predicting practical criteria, however, has usu- 
ally been negligible (except that color vision and other sensory qualities are 
important in some tasks). 
For practical prediction, psychologists have relied on tests constructed, on 
quite another principle. Whereas Jaboratory tests have mostly dealt with 
narrowly defined functions, most practical tests are complex worksamples. 
When a complex performance is to be predicted, a sample of that very per- 
formance will often prove to be a good predictor. To minimize effects of 
Specific training and to obtain a test of wide applicability, the test may sam- 
ple, not the criterion task exactly, but the general type of reasoning or 
motor performance required by the criterion. The Bennett TMC is of this 
nature, The Block Design test is not a sample of a real task but is an artificial 
task requiring complex reasoning similar to life problems without depending 
9n special knowledge. . T" 
Practical testing came into psychology from medicine. Clinicians dealing 
With mental defectives and pathological cases needed diagnostic tests. Psy- 
chiatrists looked for tests which would distinguish normal from abnormal 
Subjects, and distinguish among various types of mental disorders. Kraepelin 
8nd other nineteenth-century psychiatrists used reasoning problems and 
tests of performance in continuous work. These tests were comparable to 
requirements of life outside the laboratory. Though few of the tests of this 
Period survive in present-day diagnosis, clinical tests still are chiefly con- 
cerned with complex processes. Alfred Binet, to whom we turn ma moment, 
Was a physician by training and he chose tests which could distinguish be- 
‘Ween clinical groups—no matter how obscure or complex the psychological 


Meaning” 
aning” of the tests. 


The Binet tests did have practical value for the physician, the educator, 


the social worker, and, in modified form, for the employer. The practical 
tosts of today are much closer to worksamples of life performance than to the 
Psychophysical measures of Wundt. These complex tests will surely never 
© replaced, but neither have they shown much recent development. Ability 
ests have remained about the same since 1920, and d tests since 
1980, The practical tests of today differ from the tests of 1920 as today’s 
‘i omobiles differ from those of the same period: a efficient and more 
gant, but i n the same principles as before. 
A asap relatively limited types of "apr a = 
€tgone more radical changes. Factor analysis of ability tests is leading to a 


160 ESSENTIALS OF PSYCHOLOGICAL TESTING 


conception of abilities and their relations going far beyond that of 1920, 
and numerous tests have been prepared to measure elementary perform- 
ances. The original suspicion that simple laboratory measures might have 
important relations to personality is now being substantiated; for example, 
whether a mental patient perceives an intermittent light as steady or flicker- 
ing is related to his diagnosis. The simpler tests are (with rare exceptions) 
too inefficient for practical use, and in that sense they stand where Wissler $ 
experiment left them. But these tests are better rooted in psychological the- 
ory than the complex tests which are most useful at the moment, and they 
should ultimately have practical value. 


The Binet Tests 


Alfred Binet, a French physician, became interested in studying judg- 
ment, attention, and reasoning about 1890. His interest in these complex 
mental processes led him to try a greater variety of tests than his predeces” 
sors had used. In studies published between 1893 and 1911, he tried t° 
find out just how “bright” and “dull” children differed. Having little precor- 
ception regarding this difference, he tried all sorts of measures: recall of oe 
its, suggestibility, size of cranium, moral judgment, tactile discrimination 
mental addition, graphology—even palmistry! He found, as did other s 
vestigators, that the tests of sensory judgment and other simple fonat 
had little relation to general mental functioning, and he gradually identi" 
the essence of intelligence as “the tendency to take and maintain a defini : 
direction; the capacity to make adaptations for the purpose of attaining * 
desired end; and the power of auto-criticism” (Terman, 1916, p. 45)- TP 

The stage was set, then, for the call in 1904 to produce the first practi” 
mental test. Paris school officials became concerned about their many pe 
learners and decided to remove the hopelessly feeble-minded to ppt 
where they could be taught a simplified curriculum. The officials could ? 
trust teachers to pick out the feeble-minded. They did not want to segre 
the child of good potentiality who was making no effort and the troub 
making child the teacher wished to be rid of. Moreover, they wante 
identify all the dull from good families whom teachers might hesitate to the 
low, and the dull with pleasant personalities who would be favored by dis- 
teacher. Therefore they asked Binet to assist in producing a method for tud- 
tinguishing the genuinely dull. Binet's scale, which drew on his earlier s jo 
ies, was published in collaboration with Simon in 1905. In 1908 a revi” 
was published, and in 1911 another. bjec 

There was a great demand at this time, especially in America, for us 
tive methods of investigating psychological development. Although T ^ 
dike was using experimental tests on animals, American psychologic? 


rate 
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search hz F , 
all uf 2. been dominated by introspection, anecdotes, and questionnaires 
which were as fallible as the person reporting. Binet's method, which 


was to a large i : : 
a large degree impartial and independent of the preconceptions of 


the t 
es ae y iasti i 
ter, was welcomed enthusiastically as a research technique and as a 


"E of studying subnormal children. 
eos. pon M. Terman began experimentation with the Binet tests. 
ended Sai the Stanford Revision of the Binet Scale in 1916. This revision 
Senfara: a of Binet s method to normal and superior children. The 
the Mike had immediate popularity and became, rightly or wrongly, 
Saline i i by which other tests were judged. Although there had been 
lion si e cilia mental tests, the outstanding popularity of the Stanford 
the Sack, 4 conception of mental ability the standard. The acceptance of 
anford test was due to the care with which it had been prepared, its 
activities, the easily understood “IQ” it 
l results which it quickly produced. 
ade of the test, it was and is an ex- 


Success j ; 
Pror 3 in testing complex mental 
vic , : Á 
Alth ed, and the important practica 
Ou ere 
igh many criticisms have been m 


Genes 
ome useful instrument. 
1e 1916 Stanford-Binet was replaced in 1937 when Terman and Merrill 


eh Forms L and M of the Stanford-Binet. These tests improved on 
The m inan of the former edition and offered two comparable forms. 
2 single S revision (1960) combines the best tests of the T revision into 
parts of i L-M and improves and updates the scoring system. In all 

he world there have been other versions taken directly from the 


inet t T 
est or one of the Terman revisions. 


M 

9re Recent Trends 
al tests since 1911 has taken two directions. On 
een increasingly designed to allow il- 


oe observation as a supplement to the accurate overall score. While 
dosen s items reveal considerable diagnostic information, they were not 
met or this purpose. We have already mentioned that much can be 
inet s about the child's personality by watching him solve mazes, and the 
Spe ib includes a few maze items. Porteus, however, capitalizes on the 
ifficu] value of the maze by providing a whole series of mazes of graduated 
Ment x The Kohs blocks have a similar advantage. The highest develop- 
Wher, tests for observation and diagnosis are the popular Wechsler scales. 
reas this trend led to more elaborate mental tests and gave great re- 


Sponsa; 
ae to the observer, the other line of evolution was toward simpler 
dures which could be applied to large 


red routinely were first demanded for 
gists had devised experimental group 


Evoluti 
k olution of general ment 
One hand. indivi 
ne hand, individual tests have b 


humbe Ore mechanical tests. Proce 
übers of people at once and sco 
Ty purposes. Several psycholo; 
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tests prior to 1916. When it became necessary in World War I to qaae 
Army at an explosive rate, the Army requested psychologists to pres sie 
group test so that inductees who were promising could be given officer ci a 
ing, those who were unfit could be rejected, and the remainder cou : 
appropriately classified. In one of the major achievements of practical oi 
chology, a group including Terman, Yerkes, and Bingham assembled à te 

whose final version became famous as Army Alpha. Alpha tested ability to 
follow directions, simple reasoning, arithmetic, and information. It ii ; 
practical test, easily administered and highly useful to the Army, as Figur 


Officers ZA 
"UN 


Enlisted Men— Sergeants "d \ 
-~ li 
"d CA "m L-A, 1 
M \ 
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FIG. 28. Alpha scores of Army personnel of various ranks (Yoakum and 
Yerkes, 1920). 


cess 
28 suggests. It convinced the nation that adequate prediction ncn 
could be achieved through mass processing, and schools and Es p re- 
were quick to demand tests of this type after the war. Alpha in a det ap 
vision and comparable group tests by Otis and others were extensively 
plied. 

Since 1920 there have been changes in test design. For example, s 
the early tests were highly speeded, time limits are generous in recent n 
ican tests. The content of today's general mental tests is not, however, Settet 
different from that of Army Alpha. They are more efficient and pe Aid 
norms, but they are not different in kind. The introduction of spes rou. 
tests such as the TMC has been the most important innovation et E nf 
testing since the 1920s. Recent research has increased the variety jw sts at? 
the psychologist finds it important to measure. Though specialized t Al an 
being used more and more in guidance, clinical work, and pe nal 
industrial selection, they are nearly always supplementary to genera 
tests derived from Binet's work. «ot had 2 

Do not assume that other lines of approach before and after Bine ly 
merit, merely because they failed to attain comparable prominence. » 
workers explored many leads which appear to have been unduly pie e use 
(Peterson, 1925). Binet himself (following still earlier workers) ™ 


reas 
mer 


MEASUREMENT OF GENERAL ABILITY 163 


of inkblots to study imaginative and perceptual processes, but this technique 
fell into obscurity from which it emerged only because Rorschach inde- 
pendently revived the procedure twenty years later. In his monograph The 
Experimental Study of Intelligence, Binet described the application of ink- 
blot and imagery tests to his daughters, arriving at qualitative descriptions 
of the way their intelligence functioned which read as if taken from the 
mosg modern results of projective techniques. The possibilities of improved 
Impressionistic procedures, which psychologists are today examining, were 
neglected while Binet's psychometric strategy of summarizing all intelli- 
gence in a single score was adopted. The accidents of time and place play a 
large part in psychological history; there was, in 1905, a great practical need 
fora simple and objective way of summarizing a child’s general level of men- 
tal development, but no popular demand for analysis of individual patterns 


of thou ght. 


CHARACTERISTICS OF THE STANFORD-BINET SCALE 


In the Stanford-Binet (SB) scale, as in every test to be studied, one can 
trace how the investigators solved four problems which face the test de- 
Signer, First, he must decide what he intends to measure. Second, he must 
invent or select items which serve that purpose. Third, he must find a meas- 
uring unit in which to express results, since behavior rarely can be described 
ìn countable units like inches, pounds, or light-years. F! ourth, he must show 
the Validity of the test. Knowing how these were solved for the SB not only 
ga wherein it made a contribution but also throws light on its limita- 
Ons, 


Assumptions About Intelligence 


The person making the first mental test is in the position of the hunter 
Boing into the woods to find an animal no one has ever seen. Everyone is 
Sure the beast exists, for he has been raiding the poultry coops, but no one 
san describe him. Since the forest contains many animals, the hunter is go- 
P8 to find a variety of tracks. The only way he can decide which one to fol- 
9w is by using some preconception, however vague, about the nature of his 
Quarry, If he seeks a large flat-footed creature he is more likely to bring back 
that sort of carcass. If he goes in convinced that the damage was done by a 
Pack of small rodents, his bag will probably consist of whatever unlucky 
*odents show their heads. idee 

Binet was in just this position. He knew there must be something like in- 
*elligence, since its everyday effects could be seen, but he could not describe 
What he wished to measure, as it had never been isolated. Some workers, 
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then and now, have objected to this circular and tentative approach 
whereby mental ability can be defined only after the test has been made. 
Tests are much easier to interpret if the items conform perfectly to a defi- 
nition laid down in advance. When faculty psychology was in vogue, many 
separate tests were designed for the separate mental faculties: reasoning, 
memory, attention, sensory discrimination, and so on. None of these tests 
used singly, however, was found to have predictive value. Terman (1916, 
p. 151) explained this as follows: 


H H H H J f 
The assumption that it is easier to measure a part, or one aspect, 


intelligence than all of it, is fallacious in that the parts are not separate 
parts and can not be separated by any refinement of experiment. They 
are interwoven and intertwined. . . . Memory, for example, cannot be 
tested separately from attention, or sense discrimination separately 
from the associative processes. After vainly trying to disentangle the 
various intellective functions Binet decided to test their combined fune" 
tional capacity without any pretense of measuring the exact contribu 
tion of each to the total product. 


: i . pinot as” 
Modern diagnostic tests do obtain useful information about distinct : 
x 


pects of ability. In Binet’s time, though, one of his great contributions V : 
to replace the idea of separate functions with the concept of general inte 
gence. Having started with the idea that some children were bright E 
some dull, he found quickly that those who were best on tests of judge 
were also superior in attention, memory, vocabulary, etc. In other words, © E 
tests were correlated. The correlation shows that there must be some wm 
lying unity among these mental tests. When psychologists refer to gene 
mental ability, they refer to the characteristic that accounts for the c? 
tion among mental tests. 

Binet refined his idea of intelligence by trial and error. If color m 
does not correlate with other estimates of mental ability, it must not 
fluenced by the common factor. If knowing certain information corre 
with the tests of reasoning, both must measure intelligence. Out of a sf 


rrela- 


atchin£ 
e ine 
Jates 


eral mental ability or test of general scholastic ability. "Intelligence $ is 
connotes some sort of inborn mental superiority. Performance on » The 
influenced by many things not included in this concept of "ntelligeno" se, 
test calls for knowledge, skills, and attitudes developed in Western en «in 
and perhaps better developed in some environments than in others: pm ire? 
telligent" person will do badly if he lacks the background the test red" ed: 
A person is born with potentialities which may or may not be develop € 
The Binet scale gives only very indirect evidence on “potentialities nee 
can observe potentiality only when it has been developed into perform? 
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1. P 
TUR do the following definitions of "intelligence" include that Binet's definition 
oes not, and vice versa? 
^ “The ability to do abstract thinking” (Terman). 
- "The power of good responses from the point o 
Thorndike). 
c. "The property of so recombining oU 


PEN novel situations” (Wells). 
- Would the same sort of test items be called for by each of these definitions? 


3. è 
Is previous learning included in intelligence by these definitions? By Binet's? 


f view of truth or fact" (E. L. 


r behavior patterns as to act better in 


Selection of Items 


ould ask a boy to jump over standards 


To test high-jumping ability, we W 
ones and increasing the height until 


cu heights, beginning with easy | 
Vie wii. the highest level at which he could succeed. The experimental psy- 
wes Qt uses the same device in measuring weight diseriminanon. The test 
ae "SN pairs of weights which are easily discriminated, and the differ- 
sis within pairs is gradually reduced until the person can no longer tell l 
Which is heavier. The Binet scale sets up similar “hurdles.” It begins with | 
items the subject is expected to pass, but as the items become more diff- 
cult, the subject begins to fail. The test is continued until we have deter- 
ented the most difficult mental hurdle he can get over. 
E studying bright and dull school children, realized that mental abil- 
ing m with age. The older child is superior 1n taking directions, mak- 
talte aptations, and judging his own ideas. It follows, then, that a good men- 
st item should be easier for older children than for younger ones. Àn 


item s E 
it m should not be used if just 25 percent of children of every age can pass 
—such an item is difficult, but it does not reflect mental development. In 

preference was given to 


€ items for the Binet test and its revisions, Pre" 
dr. on which success is markedly related to age. Binet further assumec 
tal du was important to measure à general quality running through all men- 
asks. Therefore, a good item should correlate with the rest of the scale. 
Pi are located in the scale according to their difficulty for children at 
age. A test which about 60 percent of 13-year-old children can pass is 


Placed at Year XIII. 


4. W 
ane a Japanese investigator prepares a counterpart of th 
ildren, would a direct translation of the scale be satisfactory? 


e SB for Japanese 


Desas 
scription of the Scale 
who presents each 


The examiner be- 
"games" have for 


or child is given the SB by an experienced exaniiner, 
Eins » the precise manner called for by the directions. 
he Py establishing rapport, aided by the high interest the ` 
chil, Younger children and the challenge of the test situation for the older 

d. The first items tried are those for a mental level below that expected 
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TABLE 16. Representative Tasks from the Stanford-Binet Scale 


Correlation 


with Whole Nature of Nature of 
Year Task Test Stimulus Response 
Il-6 Points to toy object "we drink out of" 55 Verbal Motor 
Shows doll's hair 63 Verbal Motor 
Names chair, key 70 Object Verbal 
Repeats "4-7" .63 Verbal Verbal, 
memory 
IV Names gun, umbrella 79 Picture Verbal 
Recalls name of object (dog) when 53 Object Verbal, 
covered by box memory 
"Brother is a boy; sister isa... ." .56 Verbal Verba 
Matches circles, squares 25 Picture Motor, 
"Why do we have houses?" 70 Verbal Verba 
VI Defines orange, envelope .67 Verbal Verbal 
Gives examiner 9 blocks J7 Verbal Motor 
Maze .69 Object Mod 
“An inch is short; a mile is... ." .67 Verbal Ver 
IX Examiner notches folded ; child 
paper; chi i 
draws how it will look unfolded .62 Object Dress 
Verbal absurdities .83 Verbal Ver wing 
Reproduces design from memory .60 Picture promet 
I 
Repeats '"8—5-2-6" backward El Verbal keen 
7 Jation 
Figures change from a purchase .62 Verbal Calc! 
I 
XII Defines skill, juggler 79 Verbal Vere 
Finds absurdity in picture 51 Picture yerna] 
Defines constant, courage 85 Verbal Ver 
Completes "The streams are dry... bal 
there has been little rain." 42/2 Verbal ver 
al 
deg Defines regard, disproportionate .86 Verbal Verb 
vit 
Explains how to measure 2 pints of rbal 
water with a 5-pint and a 3-pint can 70 Verbal Ve dl 
Explains a proverb 73 Verbal Ve pal 
Compares laziness and idleness .80 Verbal Ver 


the 4 sal 
Tests fos 
j, Tes 


of the child; beginning with easy tasks builds confidence. First, 
age, the scale level at which the child passes all the tests, is located. 
the higher levels are then given in order, usually six tests at each leve 
ing continues until the child fails all tests at some level. wa to 
In Form L-M, tests cover levels of mental development from 8$ nate) 
Superior Adult III. From ages 2 to 5, there are six tests (plus one aten ea 
at each half-year of development. Above age 5, hurdles are spaced en 
apart; and above age 14, the levels have even wider spacing. No 
the entire set of tests. A 9-year-old would begin with tests at year ; » 
if he passed those, would continue until he reached his limit of — ners 
9-year-olds would be unable to go beyond the 11-year level, whereas 
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ar level. One hour, more or less, 


would still be passing a few tests at the 14-ye 
variation from 


is required for the testing procedure, although there is great 
child to child. 

Administering the scale requi 
ble judgment in obtaining from 
out probing more than the stand 
cially a problem with younger children, 


to tasks calling for sustained attention. 
The child's response to the test varies greatly with his motivation and se- 


curity in the testing situation. The insecure child’s performance in a strange 
task with a strange examiner may be far below his potential. McHugh 
(1943) gave the SB to pupils entering kindergarten and then retested them 
after two months, Their mental-age scores increased nearly six months dur- 
ing this period and their IQs by 6 points, on the average. Tasks requiring oral 
| response showed twice as much change as tasks calling for manipulative re- 
Sponses, McHugh suggests that shyness in a new situation accounts for most 

of the difference between the first and second test. 
The young child often refuses to try items he should be able to solve, as in 
this set of examiner’s notes on a boy in kindergarten (Mayer, 1935, p. 325): 
of laugh when refusing to respond, 
Same in school, according to teacher. 


smilingly and pleasantly, but will not 
him physical examination "be- 


res skill. The tester must exercise considera- 
each child as clear answers as possible, with- 


ardized directions allow. Rapport is espe- 
who are not accustomed to tests or 


Always smiled and gave a sort 
but was none the less determined. 
Refused to do things, but always 
yield. School doctor was unable to give a 
cause he refuses to open his mouth or do anything asked. On: Pictures 
he said, “no,” and politely but conclusively turned the page. I won't tell 
you" was his affable answer for Comprehension, Materials, Opposite 
Analogies. *No, I don't want to” disposed of all d pt E pushed 
away Dutioung, refused to attempt the Knot, didn't want to draw a tri- 


i ivi liver- 
hg i try. Even with the privilege of de 
gle, but was prevailed ae z = Te pail pes aera 


ing a note to the teacher o 
bird. “No” 21 ey io” He didn't want to fold a square, 
izd. “No,” he said, “TI make a P$ onding to Paper Folding 


(o s his alacrity in resp 
marked contrast wa sin too,” an d took the paper poene, 


—Triangle: "I'm going to mak À 
it to him. 


fore the examiner could give 
s an item he has refused if it is 


Rust i asse: 

et on t Fe dor id e 
Afterrefusal emcee rr M 3-year-olds tested would raise their IQs by 
e Points or ip Cn Merrill-Palmer scale for corer” Á—À a 
Section procedure to take refusals into account, but the ad OES Bos 

Only those pe hould give the sg who have been trained in its use 
8nd, coring P rsons sho wd tes an adequate training program calls fora 
“neral iln ere) theory, 2 practicum course during which the 
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student tests at least 25 subjects, and further experience in clinical courses 
where he gives the test to various kinds of subjects. Training may be B 
sidered complete when the person has tested about 100 cases under super 
vision, beyond the 25 practice subjects. 16. 

The scale includes a great variety of tasks, as can be seen from Table 
Verbal Absurdities (Form L, Year IX; Terman and Merrill, 1937) is a rather 
typical test:* 


Procedure: Read each statement and, after each one, ask “What is foolish about 
that?" If the response is ambiguous, say, “Why is it (that) foolish?’ T 
(a) Bill Jones' feet are so big that he has to pull his trousers on over his vail: 
(b) A man called one day at the post office and asked if there was a letter it 
ing for him. "What is your name?" asked the postmaster. "Why," said the 
“you will find my name on the envelope." ] after 
(c) The fireman hurried to the burning house, got his fire hose ready, and : 
smoking a cigar, put out the fire, T which 
(d) In an old gravevard in Spain they have discovered a small sku rs old. 
they believe to be that of Christopher Columbus when he was about ten n ; the 
(e) One day we saw several icebergs that had been entirely melted 9Y 
warmth of the Gulf Stream. 


ead. 


" is 

The child passes this at Year IX and receives two months’ credit pines 
mental-age score if three responses are satisfactory. Four correct allows 
more months’ credit and counts as a pass at Year XII. simple 

The various subtests call for verbal and nonverbal performances, ke 
memory and complex reasoning, learned answers to familiar questions, 3 a 
solution of novel problems calling for ability to adapt. Tasks imd 
jects and pictures are used at younger ages, with an increasing Rn 
verbal problems through the school ages, and more tests of abstract thin 
at the upper end of the scale. 7 which 

Scoring is made as objective as possible by means of a scoring guide “ities 
contains specimen acceptable and unacceptable answers. In an absut ity? 
item, the subject is expected to recognize clearly the central abs! 
and not to bring in irrelevant matters. 


5. Judge the following answers to the problem “Bill Jones’ feet are so big * 
(from Pintner et al., 1944, p- 60) as right or wrong: 
a. You can't put them on because his legs are joined together. in the” 
b. You can't put your trousers over your head because your legs are 
€. He's supposed to put the trousers over his feet. 

d. A man couldn't put his pants over his head. 


Scoring System 


ntal de- 


, ; e 
Binet's plan of successive hurdles makes it possible to report m tal age 


velopment in a simple and easily comprehended score called the men 


1 Copyright 1937, 1960, Houghton Mifflin Co., and used by permission. 
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Fe. subject’s mental age is the chronological age at which the average child 
oes as well as the subject does. John, who is only 5, can earn a mental age of 


8 if he does as well as the average 8-year-old. Scoring would be simple if 
a certain level and failed all tests after that level. 


5 the mental age is determined by add- 
îng credits (usually two months per test) for each test passed. Where test 
levels are two years apart, and six tests compose each level, each test counts 
four months; where levels are six months apart, each test counts one month. 
The total credit in months is converted into a mental age in years and 
months (written thus: 5-8 for 5 years, 8 months). 
f Table 17 reports the performance of six children. 
shows uniform performance; when he begins to fail, he f. 


ance of Six Children 


Herbert May Bruce Nancy 


i child passed all tests to 
ecause the failures enter gradually, 


The first one, Frank, 
ails nearly all tests 


TABLE 17. Stanford-Binet Perform 


an Frank Billy 
Age z4 $2 98 58 2b "8 
N i " 
«i bun Number of Tests Passed by Child at Each Level 


Year Tests (Months) a eee 
eee Aes. OnE | 


M 
M 
VII 
Vill 
Ix 
X 
XI 
XII 
XII 


ceives a base credit of 6 years of 
dd 8 months’ credit; one at VIII 
10 months. Billy has greater 
figured as follows: 


oa 


DaO O O O 
NNNNNNNNN 
o--ooro| 
OVNA 
-orrol|lll 


at the 
that level. His basal level is VI, so he re 


mer A 
Na] age. Four tests passed at Year VII a 


ad ^ 
«C d5 2 months, His mental age (MA) is ayeas, 
MA is 


Scatter” 2 
9r" of successes and failures. His 


6 yrs. 
Basal Age VI h 8 mo. 
vil 4 tests, 2 mo. eag 8 mo. 
VIII Atests n 4 mo. 
IX 2 tests, : ü " 6 mo. 
X 3 tests, " 6 yrs., 26 mo.; 
t — MA 


Total 8 yrs, 2mo. = 


erformance; it is in effect the raw score 

pild if his MA is greater than his life 
: the same average level of ability, 
ent. Young superior children pass 


Menta] 


9n the age measures the child's P 
age test. Obviously he is a bright 
but. Two children of the same MA have 

€Y may differ in pattern of developm 
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different tests than older subnormal children (Magaret and Thompson, 
1950). 

The MA is an estimate of present performance and of promise in t 
mediate future. In any classroom, young superior children are more ne 3 
equal in performance to average children than to bright children of gene 
age. In making decisions within a group of varied age (e.g., in sectioning e: 
classes) the mental age rather than the IQ gives the most relevant meern 
tion. In research also, if it is desired to equate groups, to separate oe 
unequal ability, or to correlate some other variable with mental ability; be 
mental age should be used rather than the IQ. This principle is er sf 
lated. The correlation of IQ with another variable is lower than that o£ ^ 
with the same variable in a group of mixed age. Joyed 

Mental growth is slow after age 15 or 16. The mental-age units emp e 
for higher ages are not directly related to the average performance att dim 
ages and therefore should be considered only as another form of raw $C 
The average 20-year-old has a mental age well below 20. 


he im- 


arly 


6. Compute the mental ages for the remaining four children in Table 17. sts ovt 

7. A 20-year-old passes the following tests: XIV, all; Average Adult, 7 | months 
of 8, credit 2 months each; Superior Adult I, 2 tests out of 6, credit AA 
each; Superior Adult Il, 1 test out of 6, credit 5 months each. Find his MA- 


THE INTELLIGENCE QUOTIENT y 
ayer, 
The “intelligence quotient” in the 1960 Stanford-Binet and nearly om 
other current test is nothing more than a standard score. Instead gn nve” 
mon scale with a mean of 50 and a standard deviation of 10, the IQ ps 
sion fixes the mean at 100 and the standard deviation at 16. Since jt [o 
distribution is nearly normal, the IQ can be interpreted as an s p to 8 
the child's position in the group. The mental-age score is converte e pro 
IQ by referring to tables in the Stanford-Binet manual. Tables ? sed: 
vided for ages from 2 to 18. For adults, the 18-year-old norms can : jctly 
although, as will be seen later, the average mental-test score ism 
constant throughout maturity. tradi 
This IQ is not really a quotient at all, and if it were not for its lon£ m scal? 
there would be considerable advantage in employing a standard-s0" | ast 
with mean 50, as in other ability tests. The IQ was originally introdu t, M 
ratio or quotient representing the child's rate of mental develop na the 
tal age was divided by actual age and multiplied by 100 to pr an 
decimal fraction. For Frank, whose age is 6 years and 4 months ( iE 
whose mental age is 6-10 (6.83), the ratio IQ is 108. Developme 
rapid than the average is indicated by a quotient over 100. ons: th? 
The calculation of ratio IQs fell into disrepute for several reas 


tio” 
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quotient was originally thought of as representing a fixed rate of develop- 
ment which could not only express relative brightness today but also predict 
mental age at subsequent ages. Evidence to be discussed below indicates 


that the child's rate of mental development is not fixed. Moreover, the ratio 


d i ica ' 
epends upon technical characteristics of the scale as well as upon his men- 


tal growth. In the 1937 SB, the standard deviation of IQs was below 16 at 
ages 5 and 6, and much larger at ages 3 and 12. These variations resulted, 
on from changing rates of development, but from the distribution of item 
ifficulties at the various ages. A third objection was that the ratio IQ could 
AGE be applied to persons beyond age 18, where mental-age units become 
arbitrary, Special corrections were introduced to obtain IQs on Forms L and 
bes older subjects. For the 1960 revision, the investigators calculated the 
andard deviation of mental age for a representative sample of persons at 
a age. Whatever MA fell one standard deviation above the mean for 
te age was converted into an IQ of 116. A standard-score IQ formed in this 

anner is often called a "deviation 1Q.” 
ae the interim while the 1960 revision is rep 
will be some confusion because IQs on the two 


Comparable. A 12-year-old who has a ratio IQ of 18 
emia have a deviation IQ of 182. These differences probably do not distort 
greatly the mean IQs in typical groups or the correlations of IQ with other 

ariables, Research results from Forms L and M may be used in interpreting 


Form L-M. 


lacing the 1937 revision, 
scales are not strictly 
8 on the older scale 


ildren in Table 17. 
f mental ages is 2 years and 10 months, 


n MA of 16. 


8. 
E ampia ratio IQs for the remaining ch 
fi for age 15 the standard deviation o 
ind the deviation IQ corresponding to a 


Distribution of IQs 


The distribution of IQs in the 1937 standardization sample is shown in 
ig 18. The change to deviation IQs is not expected to alter this distribu- 
isl greatly, The other two columns give comparative data for high-school 

College samples. 
" While comparison of any person or group with the total national popula- 


On i $a : i 
o I$ of some value, practical decisions require us to estimate how a person 
he child enters school, his 


i " 
co l fit into a more selected group. Even when t l 
™panions are not representative of the total population, for some sub- 


ee children are institutionalized or cared for in the home. Community 
Neighborhood differences also restrict the range of any class. Through 
© grades there is slow but continuous elimination, especially where chil- 
v àre permitted to leave school to work. The superior child is less likely 

*àve school than the child who is frustrated in schoolwork. The end result 
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TABLE 18. Percentage Distribution of IQs 


Standardizing High-School 
Sample Graduates College Entrants 

IQ (N — 2,904) (N — 21,597) (N — 1,093) 
140 and above 1.3) 
130-139 3.1)12.6 9.7 317 
120-129 82| 
110-119 18.1 22.8 46.1 
100-109 23.5 29.9 18.1 
90-99 23.0 23.2 4.0 
80-89 14.5 
70-79 5.6)22.7 14.3 J 
Below 70 2.6 

Nore: The standardizing sample data are for ratio IQs on the 1937 Stanford- 


Binet (Terman and Merrill, 1959). High-school data are for group tests as recorded 

in school files (Semens et al, 1956). College data are for Wechsler-Bellevue ane 

WAIS administered to freshmen at San Jose State College (Plant, 1958). 
r osa : e of 
is a gradual rise in the average level. A study of a representative samp! " 

; : E s ; truc 
dropouts in five school systems, made in the 1940's, permits us to const » 
» 16; sin. 

Table 19. The very dull tend to drop out as soon as they reach age 16; 


TABLE 19. Educational Records of 2500 Seventh-Graders 


Intelligence Quotient 


Below 
85 85-94 95-104 105-114 115+ 


All cases in Grade 7 a0 — 5755 650 55 "9 
Dropouts in Grades 7 and 8 93 30 14 5 8 
Remainder entering Grade 9 307 545 636 570 S 
Dropouts in Grades 9 and 10 241 171 — 13 78 — 445 
Remainder entering Grade 11 66 374 493 492 3 5 
Dropouts in Grades 11 and 12 52 65 81 55 2 
Remainder continuing to 4 
graduation 14 309 412 437 34 


Source: Dillon, 1949. 


jnth 
they are usually retarded one or two grades, they leave before the an " 


grade. By the end of high school, almost no one with IQ below 85 is 5 
school. A few of the superior pupils drop out because of lack of int e 
financial problems, and other difficulties. Because tests other than the e IQ 
were used, these IQs are not precisely comparable to Binet IQs. died 
range in high school is obviously unlike the representative sample sal a 
by Terman and Merrill, and college groups are even more sae 


Table 18 shows. Jas to 
Since the range of abilities varies from school to school and from we as. 
class, a final judgment of the pupil's standing must be based on local ne i nd 
Local norms change from time to time owing to population migrat? ) 
changing school policies. At the college level, Wolfle's report (1984, P 
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provides an important warning against overgeneralizing in interpreting an 
IQ. He studied 41 representative colleges with the AGCT test, whose scale 
aS roughly comparable to an IQ scale. In the highest of the 41 colleges, the 
middle 50 percent of the entering freshmen fell between 126 and 187; in 
the lowest, the middle range was from 99 to 117. Clearly, a student who 
Would succeed readily in one college might be far below his competitors in 
another. Information about the competition to be expected is necessary in 
Buiding potential college students, both to insure that they will consider 
Colleges where they have a chance to be accepted and to increase their 
Chance of survival in the college they choose. To assist counselors, many col- 
leges now publish leaflets describing the ability distribution in their entering 
Classes, usually based on the Scholastic Aptitude Test of the College En- 


tra f 
ance Examination Board. 


Meani 
aning of Particular IQs 
such as “normal,” “near 


S ; ; 
ome writers translate IQ levels into labels 
ecause there is no border- 


enius ? * aces : 
S nius," “feeble-minded,” etc. This is misleading, b 
ne at which genius, for example, suddenly appears. Some persons of IQ 


110 make significant original contributions, and some of IQ 160 lead un- 
o einguished adult lives. Some adults of IQ 80 are incapable of adjustment 
oiie world, and some of IQ 60 suppor and make an adequate 

me, 


t themselves 


= classification of mental deficiency provides a starting point for thinking 
about the individual case. Persons with IQs from 40 to 59, for example, may 
* labeled morons ( Bernreuter and Carr, 1938), but while these categories 


are convenient, it is wrong to think of them as pigeonholes. A quantitative 
Standard might seem to be the most just procedure for determining ad- 
Peis to an institution for the mentally deficient, but this policy, tried in 
le earlier days of testing, Jed to some ludicrous results. The distinguished 
Zech statesman Jan Masaryk, during 2 childhood stay in America, was con- 
me briefly in an institution which had such a policy, no doubt because hav- 
"d to use a strange language pulled down his Binet IQ (Porteus, 1950, 
» 0). Clinical disposition of a case is e based on a combination 
ier data with evidence on the functioning in social and 
ic 


always to b 
person's 


al situations. 
The human meaning of the high IQ is shown in the research by Catherine 


E Miles, who estimated the 1Qs of famous persons from their childhood 
istories, “Voltaire wrote verses from his cradle; Coleridge at 3 could read a 


cha inuet at 5; Goethe, at 8, pro- 
pter fro : Mozart omposed a minuet at 9; 5 , pro 
m the Bible. Moz ity” (Cox, 1926, p. 217). The minimum 


uc x "NP 
sd literary work of adult superiority : 
5 which could account for the recorded facts about these men were esti- 
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mated as: Voltaire, 180; Coleridge, 175; Mozart, 160; Goethe, 190. The true 
IQs might conceivably have been higher, but full evidence was not undae. 

Terman and Oden (1947) followed children with high IQs into adulthood. 
Considered as a group, these young adults were found to be in every he 
superior to average men and women of comparable age. Here are a few A 
the facts about their careers: 90 percent entered college and 70 pa 
graduated. At an average age of 40, the 800 men had published 67 books, 
over 1400 scientific and professional articles, and over 200 short stories = 
plays. They had more than 150 patents to their credit. As Terman ae 
(1954), “nearly all the statistics of this group are from 10 to 30 times as larg? 
as would be expected for 800 men representative of the general population: 
(See also Terman and Oden, 1959.) 

The meaning of the IQ is best understood by one who has observed st 
children of known IQ in particular situations. A partial substitute for such he 
background may be gathered from the research literature, where variou 
writers have established the IQ requirements of particular tasks. The ge P 
trend of these results is indicated in Table 20 and Figure 38. These stan 


neral 


TABLE 20. Expectancies at Various Levels of Mental Ability 


130 Mean of persons receiving Ph.D. 
120 Mean of college graduates 
115 Mean of freshmen in typical four-year college 
Mean of children from white-collar and skilled-labor homes 
110 Mean of high-school graduates 
Has 50—50 chance of graduating from college 
105 About 50-50 chance of passing in academic high-school cur- 


riculum 
100 Average for total population 
90 Mean of children from low-income city homes or rural homes 


Adult can perform jobs requiring some judgment (operate sewing 
machine, assemble parts) 
75 About 50-50 chance of reaching high school 
Adult can keep small store, perform in orchestra 


60 Adult can repair furniture, harvest vegetables, assist electrician 
50 Adult can do simple carpentry, domestic work 
40 Adult can mow lawns, do simple laundry 


Sources: Beckham, 1930; Havighurst and Janke, 1944; Plant and Richardson: 
1958; Wolfle, 1954; Guide to the Use of GATB, 1958; and others. 
but they 


ards are not dependable guides for decisions in specific situations, 


are nevertheless worth study. à 
ere is 
10. Why is the mean of college freshmen IQs higher than the level where th 
50—50 chance of succeeding in college? Is this a desirable apo rer 
11. If the academic curriculum requires an IQ of 105, what does this | 
garding educational planning for below-average youth? 
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145-149 
140-144 
135-139 
130-134 
125-129 | 
120-124 IZ [ 
115-119 [| ma 
110-114 | 

105-109 
100-104 
95.99 ae 
90-94 dd 
85.89 
80-84 
75-79 
70-74 
65-69 
60-64 
55.59 
50-54 
45-49 
40-44 


L LQ. 


40-44 
45-49 
50-54 
55-59 
60-64 
65-69 
70-74 
75-79 
100-104 


ted on t 
FIG. 29. 1@s obtained by 7-year-olds when feste 
(Terman and Merrill, 1937, p. 45). 


Error of Measurement h the IQ is affected by short; 
H muc A x 

A Coefficient of equivalence, a qium by administering Forms L 

as been a 


es 
erm errors of measurement, h bout 91 for uns elsated! cas 


ion is à 
oe M a few days apart. The apogee! re the SB as one of the most 


t 
erman and Merrill, 1997, p. 47): T hift of IQ from one measuremen 
reliable of all tests, Even so, the average § for IQ 100, 2.5 for IQs below 70. 
9 another is substantial: 5.9 for QUAE 14 points from an etn 
is 0 may be ^^ s are infrequent. The 
i hs sam ies "e = ae b although ca p the scatter diagram 
est w — € we ca of measurement J ae the test is more pre- 
Or 7.. "r to'yisualize xd d in Figure 99. We see Notice also the occasional 
Cise Pi ric gon ges below 1Q 80 are De the two measures. For 
ow IQs; chan t be 
i Á emen z about 87 
ni 2n despite the Les y ent M estimates range from 
with IQ 95-99 on Form L, 


to about 119, 
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12. What range of Form M IQs is found among children earning 130-134 on 
Form L? " 

13. There are seventeen cases having Form L IQs of 125 and above. Their median 
IQ on Form L is about 134. How many of them earned a higher score on 
Form M? How many shifted to a lower class-interval? d 

14. Would the interpretation for any child be changed if his Form M IQ were me 
instead of his Form L score? 

15. What is the largest change of IQ in the chart? 


Stability 


The stability of mental performance has direct practical importance; ae 
we cannot make long-range educational and vocational plans if ability 
changes greatly. Evidence on stability is also of great theoretical importance 
since it throws light on the nature of intelligent performance, and on the E 
tent to which performance is predetermined by heredity and by even 
early in life. j 

Scores on the lower levels of the SB are much poorer predictors of “cul 
IQ than are scores during the school years. One reason for inconsistency ps 
tween early and later tests is that the nature of the test items change par 
therefore different abilities are measured. Environmental influences a 
early years may also develop abilities not shown in early test performant? 
or may retard those which did show. Bayley (1949) retested children t 
peatedly from age 1 month to age 18 years. Although her results are base et 
part on special tests for infants and young children which we have ae 
described (see pp. 208 ff.), the findings apply to any present mental te 


"n H e re 
Table 21 gives correlations between earlier and later measures. In thes 
TABLE 21. Correlation of Mental Test with Test at a Later Age 
t 
| d qe 
Approximate Age Years Elapsed Between First and Secon 
at First Test Name of First Test 1 3 6 
2 
3 months California First-Year -10(CFY) .05(CP) —-13 0 
1 year California First-Year .47(CP) 23 43 ^42 
2 years California Preschool — 74(CP) 55 50 33 
3 years California Preschool 64 — 55 70 
4 years Stanford-Binet — Z1 73 saw) 
é-years Stanford-Binet 86 84 81 “gow! 
7 years Stanford-Binet 88 .87 73 -+ 
9 years Stanford-Binet .88 .82 .87 — 
11 years Stanford-Binet .93 .93 92 


PEG 
Aey e 
5E d pBayle* FL 
Source: Bayley, 1949. Some entries have been estimated from closely related data IU given 


port. Initials indicate second test; W stands for Wechsler-Bellevue. Where no initial 
Stanford-Binet is the second test. 


‘3, 
"T IQ" 
sults it is clearly seen that the later a test is given, the more stable the " 


R sho 
Tests before age 2 are unstable even over short periods. Scores 
marked increase in long-range predictive power near age 6. 
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Figure 30 charts the change in scores found upon retesting of behavior- 


problem children with the 1916 revision. The average time between testings 
was 15 months, and the age at first test was generally between 7 and 14. 
A. W. Brown (1930) comments on these data as follows: 

is high, a large number of cases make 


Although the correlation . . - 
al point of view these are often 


considerable change, and from the clinic 


60 
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Change of I.Q. from First to Second Test 


anford-Binet is reP 


o 


eated after an aver- 


FIG. 30. Changes in IQ when 1916 St 
age interval of fifteen months (A. W- Brown, 


ndred eight cases or 15.2 percent change 
that the average change is about five 


the important cases. One hu 
cause in dealing with clinical cases 


eleven points or more. To say 


points doe at deal, be 
s not help a great Ces", iae 
One can never be sure that the particular case under observation may 


not be one that will show a large amount of change. It would ns ad- 
visable therefore to secure at least Wo ratings wherever an €— igence 
rating is especially important in disposing of the:cAse.or m mangane 


ommendati 
ations. 
en longer time (R. Brown, 


Anot T ver an ev 
her study of similar children o an 30 IQ points and 10 percent 


83) f e 

ound that 3 percent changed mor : 

changed 21 to 30 inte It is unsound practice to rely on a tests given 

Severa] yes we P 1 i Extreme reversals, from IQ 70 to IQ 120, are rare, 
iod! preyious AM n most large groups. 


Ut so : -e found i 
me highly important shifts are : A US 
e natur 9 of 10 sett ;s seen most clearly in the records of individual 


ren who have been tested repeatedly. No "typical pattern can be 


chi] d 
Sho i forms. The three patterns in Fig- 
Wn, for the changes take many different p g 
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ure 31, selected to show some of the possible trends, are by no means excep- 
tional. ( These records are plotted in terms of standard scores, with the mean 
for children in this study taken as zero.) Case 783 is a boy whose test per 
formance did remain stable even though he had a poor health history, an in- 
secure and underprivileged home background, poor grades, and emotional 
symptoms such as stammering and enuresis. "There never was a time in his 


2 


Case 567 


Case 783 


Standard Score 
o 


Case 946 


a "P 


N, - 


3 6 9 12 15 18 


Age 


FIG. 31. Records made by three children on successive mental tests (Honzik 
ef al., 1948). 


history when he was not confronted with extreme frustrations." The IQ a 
theless held to the same satisfactory level. Case 946 has had IQs as low di 
and as high as 149. Her parents were immigrants, and unhappily m i re- 
their conflict led to a divorce when the girl was 7. At 9, with her mother re 
married, the girl was insecure at home and excessively modest. Her pont ) 
covery perhaps reflects better adjustment to her family. The third case ( rave 
shows consistent improvement. This girl's early years were marked by A 0, 
illnesses in the family, and the girl herself was sickly and shy. After 48 and 
her social life expanded and she developed rewarding interests in pet 
sports. This blossoming is paralleled in the test scores. Little is yet 

about the causes of spurts of this kind. 
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Scores of emotionally disturbed or uncoóperative children are especially 
unstable. If maladjustment is continuous, the child's test score and his gen- 
eral performance may be constant, at an impaired level. But if the causes of 
emotional disturbance are remedied, drastic changes in IQ occur. Long- 
range planning on the basis of the IQ is justified so long as two precautions 
are observed: Interpretation must consider the elements in the child's back- 
Sround which would tend to raise or lower scores, and all judgments must be 
made tentatively, leaving the way open for a change of plans when change 
in development appears. The case of Danny (Lowell, 1941) should make 
Clear the hazards that await the psychologist who treats every IQ as immuta- 
ble. 

Danny was born January 15, 1929. He entered kindergarten at the age of 
5 years and was such a misfit that after a few weeks he was given a Binet 
test. The following are records of the four tests given before the end of 
Grade 6, with the date of test, chronological age, mental age, and IQ. 


2-2-34 Age 5-0 MA 4-2 IQ 82 
5-9-35 Age 6-4 MA 6-2 IQ 98 
6-8-37 Age 8-5 MA 9-4 IQ 111 

MA 15-9 IQ 132 


12-3-40 Age 11-11 

The first test showed such mental immaturity that Danny was excluded 
from kindergarten for a year. The next year he moved into another school dis- 
trict. This time his Binet score seemed normal; he was placed in the first 
Btade in September in spite of his lack of social adjustment. The teachers 
complained that Danny seemed to live in a world of his own, was noticeably 
Poor in motor codrdination, and had a worried look on his face most of the 
time, The mother was called in, and only then was light thrown on his pe- 


e Il a baby his father had d 
Th ds ‘le Danny was still a baby his father had de- 
e mother explained that while y ther to work, they lived in the 


veloped enc del «der for the mo 
ephalitis. In order for A 
Baidparents ha mini s Danny could be cared for. Danny's grandfather 


Was a hi tleman who was much annoyed by the 
d a high-strung, nervous old gen japrlg Hint Danny Batinta pote 


nild’s nois : " lated so vio 

s e and at times expostula 3 ki i i 

"ied with fear. The grandfather's chief aim was to keep bie aa t toig 
Peaceful at - cost. When Danny was excluded from kindergarten the 


Mother : ts' home. 
took } F the grandparen i . 
© next E. E -— of educational, social, and emotional 


Browth à amazed his teachers with his achievement. 
for the starved child. He ould solve arithmetic problems far 


9 becam i der and c 
ame an inveterate reader 4 s 4 i 
“Yond his grade level. He was under a doctor’s care much of the time and 
Was also treated by a psychiatrist because of his marked fears. He made 
"ends With boys in spite of physical inferiority. 
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VALIDITY OF THE STANFORD-BINET 
Predictive Validity 


The Binet test is generally used for prediction. It is employed in estimating 
the brightness of a child being considered for adoption, because Je 
tive foster parents wish to be sure that the child has a good chance of equa* 
ing their own academic and business achievement. Another frequent appli 
cation is in deciding how serious a case of mental deficiency or retardation I5: 
Here again, both school performance and adjustment to the demands 0 
normal living need to be predicted. ay The 

The stability of the IQ itself gives information on predictive validity. 2 
adopting parent wants a child whose later IQ will be comparable to NE a 
others in the family. Insofar as Binet performance at age 15 is accepted a 
fair sample of intellectual performance, that test itself serves as a crite™ 
for tests given in early years. ne 

Interest in Binet performance, however, rests ultimately on its releva " 
to external criteria. You will recall that the Cattell-Wissler tests, for en 
dropped from sight just because their predictive validity was n n 
ing. For the SB, there is rather little evidence to be cited in the form 9 ific 
to-date formal validation. For other tests, we will most often cite Spo s. 
validity coefficients which permit us to compare quantitatively the test 
ciency of various devices used for making the same decision. The Binet h 
has been the patriarch of the tribe, standing without a rival unti sere 
Wechsler test recently became available. Many validity coefficients ye . 
calculated in the early days of the test, and these results were encoural oh 
Recent predictive studies have relied almost entirely on group tests» T a 
were derived from the Binet scales. Since the individual test is now? are 
reserved almost entirely for individual study of perplexing cases, data les: 
not at hand for computing its predictive validity for representative go sB 

We may not entertain serious questions about the relevance of putt 
Score to practical prediction. Studies of high-school dropouts, of job P reat 
tialities of the mentally retarded, and the like show that the test tells 2 $ il- 
deal about the person’s expected success. Terman’s follow-up of a ite? 
dren is a particularly good long-range predictive validation against 55 guc- 
of school performance, attainment in adult professional careers, finan? 
cess, marital success, and adult mental health. ch the 

When validity coefficients are calculated, the results are always wo? 
same. Here, for example, are the correlations in one high school be 940: 
SB IQ in Grade 9 and achievement tests one year later (E. A. niin 
p. 29): 
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With reading comprehension .78 
With reading speed EG 
With English usage 59 
With history 59 
With biology 54 

48 


With geometry 
All studies show discrepancies between Binet performance and attain- 


ment, even though the predictions are right in the majority of cases. Some of 


Terman’s bright boys—not very many—failed in college, or served a prison 
and careers. Intelligence is only one facet of 


a practical decision about a child or adult. 
a person knowing only his IQ nor make 
d estimate of his mental ability. 


term, or had unhappy marriages 
individuality to be considered in 
One can neither predict behavior of 
à sound prediction without using a goo 
16. The correlation of SB IQs with grades of medical-school seniors was found to 


be only .15 in one study. The average IQ of these men was 131 (Mitchell, 


1943). Explain why the correlation was so small. 


Construct Validity: What the Test Measures 
The most important questions to be asked about the test are: What varia- 

bles affect performance? What does the construct "general mental ability 

mean? Since most subsequent general mental tests have been made to have 

high correlations with the SB, statements about the meaning of general abil- 

l 

ty apply equally to these tests. 
Looking at Table 16, we see tl 


intelligen ; ability to ma 
ce, in that they call for abi ity 
and self-criticism. That the items all depend on some commen element 


Which we can call general ability is indicated by the fact that each item cor- 
relates with the total test. But a thoroughgoing analysis must do more than 
"CGept iteme Becnuse they include an element we wish to pipes An 
equally important question is : What elements affect the i ae ias not 
ronsidered in the definition? Logical analysis plus experimental studies have 


ed to Several i rta onclusions. : ; 
* The ola bene present ability, not acon nl AI- 
though it seems obvious that no test can measure anything at ehavior here- 
ând-now, there has been much confusion during tha past f ony y ears because 
5 many people "intelligence" means inherited ability. While LT valid 
Sarily an inh ‘ality, the test measures only present ability which is 
nborn potentiality, Binet himself never con- 


a -neriences. 
spected both by innate factors and by erpen e. If a user wishes to in- 
idered that his tests measured innate capacity atone. 


®t that a difference between SB scores of two children represents an innate 
rence 


aat the test items do fit Binet’s definition of 
intain a definite set, adaptation, 
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difference, he must assume that the two children have had much the same 
experience during their lives. If a child has had the same opportunity as 
normal children to acquire skills, information concepts, and work attitudes 
called for by the tests, his failure to come up to normal performance can 
reasonably be interpreted as showing that he failed to profit from his op- 
portunities. The ability level can be changed by radical changes in early 
environment (S. A. Kirk, 1958), although we have found no general tech 
niques for “mental orthopedics” ( Binet's phrase) which will accelerate sig 
nificantly the mental development of normal children from normal homes: 
We can list endless variations in experience that would make it easier for 
one child than another to perform the Binet tasks, even if the two have € 
native ability. Freddie is only 5, but his father has played number gaie” 
with him so that he can count and add very well. Harold's mother did no 
like having her walls marked, so she refused to let Harold use pencils 9" 
crayons except under her supervision; Harold, at 7, doesn't seem to we 
drawing and is clumsy at it. Sarah lived in a remote rural area, where * 5 
never saw trains or telephones. Peter's parents are immigrants; pe 
both parents can speak English, they find it difficult and use their nà pe 
language at home, Frances has a set of books which include interesting rut 
zles, pictures containing absurdities, and pictures to compare for similar 5 
and differences. Such variations in experience as these are common and mé 
be counted on to modify both test performance and school performance 
would 


qual 


17. List for each of these children some of the tests in Form L which they 
find easier or harder than children with “normal” experiences. 
T he 
e Stanford-Binet scores are strongly weighted with verbal abililie* pen 
great majority of test items call for facility in using and understar in 
words. If the child does poorly at these tasks, he probably will do pot a 
other verbal activities. He may do badly on the test because of poo pec 
ing, but this will also cause him to do badly in school in the future. bur io the 
test is an excellent measure of scholastic aptitude, i.e., of readiness tO hich 
sort of tasks required in school. Since Binet originally sought tanis “hose 
would distinguish pupils judged superior in school performance rom im- 
judged inferior, it is not surprising that the final test measures an au h of 
portant in schoolwork. If one were to examine intelligent acts outs! eas 
school, verbal facility might be found less important. The test is not a sigh" 
ure of all types of mental ability; critics note that it underemphasizes E the 
foresight, originality, organization of ideas, and so on. A high score vs rest 
test should not be interpreted as guaranteeing the qualities which 
does not measure. je: 


Among the pupils for whom this verbal loading produces an uif ioi 
ture of overall intellectual performance are bilingual children, childre? 


MEASUREMENT OF GENERAL ABILITY 183 


nesta English is little used, children with hearing deficiencies, and 
eese TS. The examiner can often identify such cases by their spread of 
and failures, with success ON nonlanguage items at levels much 

22 compares children 


b 3 
eyond their first failure on verbal concepts. Table 


Monolingual and Bilingual Children 


TABLE 22. Mean IQ of 
mance Test 


on the Binet Test and a Perfor 


Mean IQ for Mean IQ for 
Monolinguals Bilinguals ; 
Test (N = 106) (N= 106) Difference? 
Stanford-Binet 98.7 90.9 7.8 
89.0 97.5 —8.5 


Atkins Object-Fitting 
er difference could be due to chance. 


S Blsrineanes poate show that neith 
k a second language at 
kins Object-Fitting Test, 
ot demand facility in 


superior on the nonverbal test, 


one ie only English with bilinguals who spea 

a E. 1 groups were tested on the SB perk 7 

English F ance test for preschool children which does ! 

Would b «is evident that the bilingual group, 

1 e judged inferior on the SB. 

8. Which items in Table 16 depend upon previous school learning? 
hild from an upper- 


9 
‘ Which items would offer an advantog® d © "ling-class home? 
ER E : " ished wor ing-class 
20, a child from an impoveris f deciding whether a pu- 


Suppose one is faced with Binet's original problem, of € ARIS 
Pil failing in school could profit from the regular curriculum. If the pupil is 


bili 
lingual, would the SB or a performance test serve better? 
somewhat different mental abilities 


s apparent in Table 16. Early tests 
Verbal tests and reasoning 
tests measured general in- 
t the simple mental de- 
later emergence of 
els of the scale are 
development, they 


class home 


ar diffe te Stanford-Binet score measures | 
call f "rent ages. This shift of emphasis is ap 
lay -i judgment, discrimination, and we 
telli à much greater part in later years. 1f all the : 
Woo equally well, this would be no problem. u 
Yan m of early childhood predict only roughly : he 
les and higher mental abilities. While the = r ev 
5: ent for identifying children with abnormally slow 
Not predict accurately the subjects later standing. 
Pre: he clearest study on this comes rom Maurer s work with the Minnesota 
; non tests, which are similar to a xd Binet gem Hor pus 
Sup fro ate ado escence and retestec them to de- 
Arming what bed nee best predicted intellect at maturity. 
5» maturity she zm a group test heavily weighted with verbal materials; 
* test is also highly correlated with both Binet scores and school success. 

© found that many items which correlated well with the rest of the pre- 


Si 
chool scale were poor predictors of later development. Among the poor 
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predictors were pointing out parts of the body, obeying simple commands, 
comprehension, and paper folding (Maurer, 1946). 

Measures of an ability made at the time when it is first being developed 
are generally poor predictors. How early a child learns to count, for example, 
depends upon accidental factors as well as upon his brightness. Many pu- 
pils who start later will overtake the one with the best early performance. 
Stable measures therefore must be based on abilities that are already well 
formed. Thus John E. Anderson (1944) argues that vocabulary is 8 
good test for older children just because it is based on a long period of en- 
vironmental stimulation. Maurer (p. 86) confirms this in her search for tests 
of young children which will predict later IQ. 


[Good] tests for younger children make only minimal demands m 
language. They require perception of form and spatial relationships 
and the ability to reproduce them. They do not demand complex motor 
coordinations. They require controlled attention and ability to persist to 
a goal. Many of them are comparatively independent of training. Tue 
for older children [4-5 years] involve use of language in relationships 
which are not often practiced and constitute problem-solving situations 
involving the use of well-developed tools. 


€ The test requires experiences common to the U.S. urban culture and 1$ 
of dubious value for comparing cultural groups. The Zuni Indians, for exam 
ple, have a coóperative society most unlike the competitive attitudes W^ 
tend to encourage. Zuñi children have races. But a child who wins sever? 
races is censured for having made others lose face, He must learn to W 
Some races to show he is capable, and then to hold back and give others + 
opportunity to win. In arithmetic, white teachers sent Zuñi children tO A 
blackboard for arithmetic drills, with instructions to do a problem and E? 
their backs to the board when finished. Instead, the pupils faced x 
board until the slowest had finished; then all turned. This was to them - 
ple courtesy; following the teacher's direction would have been exhibit 
ism. It is easy to see why the typical American speed test gives misleadi: 
results among the Zufii. A Binet test fares no better; the first subject may ps 
some items deliberately, because he fears the next child will be unable, 
answer. All intelligence tests face the same problem; they ie meee ae 
for comparing persons with similar experience. Anglo-Americans woul s ji 
haps do badly on a test developed by a Zuñi psychologist, using que? m 
which differentiated between good and poor members of Zufii juna of 

e The Binet test does not give a reliable measure of separate aep Er M 
mentality. Scores are influenced by specific as well as general mum - 
would be helpful in diagnosis if we could divide the Binet test int? 50 
ments and obtain separate estimates of verbal ability, information, 
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on. Any particular specialized ability is used in only a few items; therefore 

combining those items would not give a reliable measure of the ability. Binet 

and Terman deliberately sought a great variety of tests, so that no one sub- 

division of intelligence would have great weight in the final score. This fea- 

ture makes the SB unsuitable for measuring the aspects of ability sepa- 

rately, 

21. What sort of items have the greatest correlation with total test score at levels 
Ir6 and IV (see Table 16)? What does this suggest regarding the meaning of 
general mental ability” in preschool applications of the Stanford-Binet? 

22. What items have the greatest correlation with total score at the upper end of 
the scale? What is the meaning of “general ability” at that level? 


© The Binet score is influenced by the subjects personality and emotional 
habits. Binet's description of intelligence includes persistence, flexibility of 
mental approach, and criticalness, all of which are aspects of personality. 
Among the emotional habits which have an obvious effect on scores are shy- 
lack of self-confidence, and dislike for “schoolish” 


ness with strange adults, 
say “I don’t know” because he is dissatisfied 


tasks. A self-critical person may 
with the best answer he can formulate; a person less sensitive to niceties 


May give an answer which is passable. A pedantic urge to accuracy may 
make it relatively easy to do memory tasks. Fear may cause a child to 


‘freeze up” so that he cannot find a new mode of attack when his first one is 
blocked. No matter how careful a tester is, there is some danger that a child 
May fail an item that he could have passed if ability alone were required, 


One should therefore always bear in mind that the final test score shows how 
Well the child functioned at this time; this score may be markedly affected 


by emotional complications. 
Hutt (1947) points out that the child encounters considerable frustration 


from a succession of failures, and that this stress comes at different points for 
different children. He proposes to “standardize” this stress by alternating 
an experimental trial with comparable groups, he 


easy and hard items. In 
arned the same IQs on his “adap- 


found that very well-adjusted children e 
tive” procedure as on the usual test. The badly adjusted children, however, 


Averaged 4.5 points higher in IQ with the adaptive method. 

1 Any departure from standard administrative practice changes the mean- 
Mg of scores, It can be readily seen that Hutt's method will yield a higher 
Average IQ than the Terman-Merrill procedure. Many testers who are not 


Willing to take the radical step Hutt proposes, which places a tangas as judg: 


Ment o distressed by the fact that the child en- 
n th netheless dis Age 
joa seg the test with no less than six failures 


Counte: ilures endi E 
S YS more an re fai m i 
and mo : ationshi > and influences the subse- 


Ma ro i inical rel 
w. This damages the clinica dw T à 
{ent tests, To te ais outcome, and also to simplify administration, they 


Vor “serial administration,” in which all memory-for-digits items, for exam- 
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ple, are presented together. Terman and Merrill arrange items by difficulty 
rather than content, and their directions insist that this order be followed. 
Nonetheless, a good many testers have changed over to the more conven- 
ient serial plan, pointing to the evidence of Frandsen and others (1950) 
that the mean IQ is the same for the two techniques. 


; "| 
23. In indicating the importance of objective mental testing, Terman says, 


believe it is possible for the psychologist to submit, after a forty-minute diag- 

nostication, a more reliable and more enlightening estimate of the child E E 

telligence than most teachers can offer after a year of daily contact in t : 

classroom." In which of the following features does the advantage of the tes 

over the teacher's report lie? 

a. Freedom from personal prejudice. 

b. Considering more aspects of mental ability. 

c. Considering a basically different trait. 

d. Observing capacity rather than level of actual performance. 

e. Sampling behavior under a wide range of conditions. 

f. Permitting an exact comparison of the child with a standard 
24. How could you decide whether Hutt's "adaptive" procedure is mor 

the standard method? 


of normality: 
e valid than 


Diagnostic Interpretation 


Children with the same MA are of course far from alike in mental iem 
ment, as is shown by the fact that they pass quite different tests. The €: 
ford-Binet, as a standardized but complex situation, brings to light far ui 
individual differences than the single score represents. Experienced tes vn 
always study such differences, and many have tried to develop ace 
tary systems of scoring to report this information. In particular, many is 
hoped that the scatter of performance would have diagnostic value. s 
scatter is the range from the child's earliest failure to his highest pun 
suggests whether all aspects of ability have developed evenly. After ™ Jl 
studies of scatter, investigators now agree that it has no value as a score: im- 
other attempts to obtain diagnostic scores from the Stanford-Binet have § 
ilarly failed. 

The SB test will not yield meaningful diagnostic scores because s 
signed to prevent any factor save "general ability" from influencing Heo mi 
a measurable degree. We cannot trace accurately the child's qu a 
in simple recall, for example, because digit-span and other recall tes scale 
not uniformly spaced at all levels of difficulty. Even within one yea " e- 
we cannot discuss the child's strengths and weaknesses with confidenc® 
cause tests grouped together do not have exactly the same difficulty: est 

Nevertheless, the psychologist ought to study the detailed patter? m the 
performance. If a child has an unusual handicap or facility in verbal = ation: 
examiner has an excellent opportunity to note it. Deficiencies in inform 


it was de- 
s tO 
en 


MEASUREMENT OF GENERAL ABILITY 187 


arithmetic skill, and reasoning may also be noted. A distinction should be 
made between the child successful because of coaching, who does well on 
ing to 18 or saying the days of the week in or- 
der, and the more genuinely intelligent child of the same age who can 
make up a coherent story about a picture and tell what day of the week 
comes before Tuesday. These indications, even if brought to light only in one 
or two subtests, provide profitable leads for further study. They should be 


confirmed by reliable tests of the separate abilities. 

E The SB affords an excellent opportunity to see how the child works. An 
impulsive child will be observed to use trial and error in an attempt to 
force” a solution instead of reasoning. An inhibited child may refuse to take 
a chance on items where induction or imagination is called for and he can- 
not be positive that his answer is right. Others give answers even to questions 


about which they are ignorant. 
The outcome of careful clinica 

ments regarding John Sanders, a norm 

(H. E. Jones et al., 1943, pp. 91-92): 


Such teachable items as count 


] study is illustrated by the testers com- 
al adolescent (age 12-8; IQ 109) 


was interested in a variety of 
tion seemed to be rigid and 
ult for him to adapt to re- 
ntering difficulties, he fre- 
" the words or numbers; I 


t Je hn showed a lively intellectual curiosity and 
sin » but within each of these interests his y 
quite; "tracked, This lack of flexibility made it O°" 
qua dee when on unfamiliar ground. Upon € 
mie tly demanded a pencil, because he could not “se 
never tested a more eye-minded person. " 
Johu's principal difficulties were on tests requiring p 


Use of n i ea 
; umbers. With such tests he became insecure à ; by 
Vith slips of memory and errors in simple calculations. He asked to have instruc- 


tions repeated, was dependent on the examiner, and easily discouraged. Although 


Sa i P E hard for him to master a task 
Operative and anxious to do well, it was extremely ! 


Such as “ didus : uired to be exact by fixed standards. 
this is dence be ia which he was red ising that in his school 
is utsH 


itis not Map ve Hs 
ud i $ i spell, in mas ering the mechanics 
wok he has toond prent die is aage met at 
‘e has had such cras dittoslty in d arning. e med : e capper 
m be ed dut te vef pe hh om 
is i cpm form upon himself; in tas , 
Part RB upon him from without. 
in Su r the discrepancies between John 
ain fields. 


recise operations, as in the 
nd often seemed confused, 


entify emotionally disturbed persons by 
and abnormal groups do show some de- 
and Gifford (1943), for example, find 
abstract words, and dissected sen- 
mental age, but much poorer on 
for stories. Knowledge of such 


Attempts have been made to id 
inem of their subtest scores: 
Schi; re from normal averages: Myers 
enc Ophrenics superior in vocabulary: ; 

SS, compared with normals of the sam 

p* chains, picture absurdities, and memory 
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averages is useful background for the tester, but many normals show pat- 
terns which are just like those of typical patients. The SB pattern cannot be 
used to make a definite diagnosis. 

Responses sometimes reveal disturbances of thinking. Feifel (1949) has 
found that mental patients and normals respond to vocabulary items in dif- 
ferent ways. Normals tend to use synonyms, while abnormals give defini- 
tions by use and description, explanation, and illustration. Asked what an 
envelope is, normals said “a container,” “a receptacle for paper,” “something 
to put a letter in,” etc. Typical patient responses were “a piece of paper yo 
fold,” “you write letters,” “it’s sticky on top so you can paste it down,” and “to 


mail.” 
(1941) asked mer- 


Responses may disclose values and attitudes. Strauss 3 
ndertaking 


tally defective delinquents, “What ought you to do before u ath 
something very important?” (Year X). Their answers included: ae 
touch anything that doesn’t belong to you,” or “Run away from a guy a 
going to take it. Go tell him nothing of the people that owns them. a cin 
fining pity, one of them answered, “Don’t take pity on somebody, shoot t? 
and kill them." 

Essentially the SB is a standardized clinical observation. The fac 
yields an IQ should not blind the tester to his obligation to report every 
he can observe. There is no adequate rationale for making and interpre ai 
these observations, and the findings are necessarily tentative. But to iral 
them because they are subjective is no more sound than if the psyc? t 
gist were to refuse to have a conversation with the child because it woulc vith 
lead to a statistically manageable and reliable score. The Binet tester © e- 
adequate experience has a great advantage over the clinical interview 
cause he can observe the child in a standardized situation and can € 
what he does with the behavior of other children. The fact that th 
does not realize that the test situation reveals his emotions and ha 
work is a further asset. 


t that it 
thing 
ting 


ompa? 
e chilc 
pits ? 


5 
25. What sort of report should be placed in the school files for a child who? 
been given the SB? 


General Evaluation 


ar 
The Stanford-Binet scale is an instrument efficiently designed for pae 3 
ticular function, namely, providing a single score describing the chil "ocise 
ent level of general intellectual ability. It is interesting to the child, em a 
and well standardized. The large amount of research on the scale & revi 
basis for interpreting results which no newer test can offer. The 19 and 
sion makes an important improvement in discarding the conceptu? ize pas 
statistically unsatisfactory ratio IQ; on the new scale an IQ of a given : 
the same interpretation at all childhood ages. The revision retain 
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hich made the greatest contribution to the total 


ee The new scale places greater emphasis on word knowledge than 
he older forms. Items whose content is peripheral to scholastic aptitude 
Meis de to provide a more concentrated measure: The gain in ac- 
oe. we - "d some loss : e E items and in opportunity for obser- 
n " m i nates £ ay 
a St aer pat [wer ppe 
Form M, and the Wecl “i ales é j ; ed f ea ib = one pene 
MESE E i echsler scale may be usec tor cases where a second meas- 
required to confirm a doubtful Binet score. 

Pp pea ye om finds the W echsler scale for children a strong competi- 
Bi - The chief differences are 1n organization, In the greater precision of the 
cai eA mental ages, and in the greater variety of tasks in the Wechsler 
Sei i he l ifference in content of the two scales is magnified by the L-M 
bn i x aich narrows and focuses the SB. Deliberate concentration on 
ae and educational abilities is an advantage for some purposes, a disad- 

ge for others, as is made apparent in E. L. Thorndikes comment 


(1921); 


If the boy has had ordinary 


i 
tems from Forms L and M w 


American opportunities, this score [in 
standardized tests of the Binet oF of the group test type] will prophesy 
ow well he will respond to intellectual demands in 
at the time and for some time thereafter, and 
t will prophesy less accurately how well he 
a machine that he tends, crops that 
d other concrete realities that he 


rather accurately h 
cases of “book-learning” 
very possibly for all his life. I 
will respond in thinking about 


he grows, merchandise that he sells, an 
encounters in the laboratory, field, shop, and office. It may prophesy 


still less accurately how well he will succeed in thinking about people 
and their passions and in responding to these. 

ans to combine the SB with perform- 
of mental ability, nor does 
as a measure of present 


inici 
re all aspects 
ly interpreted 
al development. 


Suc " 
us objections as this have led cl 
it » scales, The SB does not measu 
Stat ©asure inborn capacity. It is proper 
i "S in one important type of ment 
s " . " 
* The Stanford-Binet has been criticized because it contains numerous items re- 
lating to death and other morbid subjects- What has this to do with the value 
i these items? 


9f an intelli brighter pupils pass 
ige t, so long as brig er puP E E 
27 Terman, in ee ea iscarded items which showed a 


al Binet scale, h : 
Consi " " s that a fair c 
nsistent difference in favor 9 Sem 


f either sex- His argument wa 
Parison could not be made if items favored one Sex or the other. Did the 
elimina l 
limination of such items raise OF low 


er the validity of his scale? 


PE 
RFORMANCE SCALES 
Jace in testing of problem children, 


Ing; 

ividual t ; mee 

is ests owe their rominent P : f 
Psychiatric patients and Pa mentally retarded chiefly to their value as a 
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test of- 


situation for observing performance. Any individually administered E 
abits 0: 


fers some opportunity to observe the nature of the subject's errors, h k 
performance, and emotional reactions. Performance tests such as Bloc 
Design, however, permit especially revealing clinical observations. Mu 
they depend very little on language and schooling, which makes them ~ 
able for evaluating young children, adults with limited schooling, and pe 
sons unfamiliar with the language of the tester. 

The Stanford-Binet includes some performance tests: drawing, be en 
stringing, fitting blocks into holes, etc. These tests are relatively few d pae 
ber, are concentrated at the easier levels, and in general do not require pal 
plex reasoning. Terman equated intelligence to "the power of AB 
thought," and therefore most of his items involved verbal or numerical ally 
cepts. While verbal items do have considerable predictive power, especia c8 
for educational criteria, clinical testers need more elaborate performan 
scales than the Binet offers. ;elde 

Performance scales give somewhat different information from that yie é 
by the Stanford-Binet. Figure 82 shows a comparison of ten children 


ad- 


Rank Order, 
Average 
Rank Order Performance 
IQ Binet IQ Test Score 


Margaret Douglas 
143 Douglas Amy 
143 Amy 


Christopher 


134 Carol ue" 

j Virginia 
132 Christopher Dick 
129 Virginia 


Walter 


120 i 
Allison Carel 


110 Mark j 
Allison 
109 Dick 
Mark 
104 Walter 


Margaret 


FIG. 32. Rank of ten superior 7-year-olds on the Binet test and 
on a battery of performance tests (Biber ef ol., 1952). 


are 

the Stanford-Binet and on a composite of three performance tests. The ds 
several shifts in rank order, the most striking change being Mere eti 
She is highly superior in schoolwork, but she has a heavy build, yet may 
and is socially awkward with adults. Her diffculty in performance pa any 
therefore reflect personality problems that limit her effectiveness * 
life situations. yille ov 

A sample of the information the performance test yields to 2 $ da inet 
server is indicated by the following record. Mark, 8 years old. "E of Cd 
MA of 8-8. On some performance tests, however, he reached an 
(Biber et al, 1952). 
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Cie striking feature of Mark's examination was his extreme lack of confi- 
LSU : te desire to do what was expected of him. This was manifested by his 
little. ; reterence to the examiner. Throughout all of the tests, although he said 

, it was evident that he was referring back to see whether the expression on 


Moro face indicated approval. 

the Meca Healy Completion [fitting sma 

tole mner noticed that once when she g 
ave an inferior solution, as if he were gui 


than by hi 
ted by his own good intelligence. Although 
to pay as little attention as is compatible with a test situation, it was impos- 


kong to prevent this. The directions in the Healy Completion to look the work over 
arefully and see if there are any changes to make seemed to imply criticism to 
"a and he removed a block which was correctly placed and substituted a 
eis His first responses were all good. In this test, he placed the first three ac- 
ies ely; then, apparently, he began feeling anxious Or uncertain, and the last 
Mese he placed were blanks. It seemed that he was using the blanks as a way o£ 
s ing committing himself to a mistake, and that he felt that he would rather do 
ing than to get the wrong result. This test was the mos 


: t plainly motivated by 
s desire for approval although there were indications of it throughout the other 
ests as well. 2 


mp? Pintner-Paterson series, suae 
Certain y because he felt more sure of himself in : 
Asha, as in the Ship and Triangle Tests [formboar 
© worked. Several times he commented, "That s easy- . . 

i «ved, working quickly, accu- 


The fi 

rst par eries he enjoy 

ra part of the [Porteus] maze s 

tely, and with ease. After his first failure in Year VIII, he seemed much more un- 
one he said, «pm not going to do any more 


Certa; 
o ^n and slow. After practically every d completed Years X, XI 
and SEN With constant encouragement, he went on an ci » : - ; XI, 
A, although he had four trials on Year XII. Toward the enc 9 the series, 

ere was little evidence of real effort on his part, but rather he seemed to be going 


Toug] s 
h the moti | „sed him on. 

ions because the examiner urg 
Probably no test results on Mark are comple accurate because other factors 


tely 

sides abili : . his behavior. Difficulty did not stimu- 
abil j i ^ npe i i 

ite him, as paren pea , but simply discouraged bim and 

le he wom) ou ee was responsive tO praise, but aly id i ci 
c ning expression, as if he were trying to ferret out bee oe - of de infor. 

B Was consistent with his total defensive attitude that heo io 2d ry little infor- 

qoton during the test and several times he responded very shortly to questions 
at 7 

E the examiner put to him. 


II blocks into holes to complete a picture] 
ave him a friendly smile he was content 
ded much more by his wish to please 
she busied herself with papers and 


less conscious of the examiner, 
these tests. When he was un- 
s], he would look up shyly 


» 


he seemed to be 


ons throw o^ the interpretation of Mark's 


What light do these observati 
uld attempt to observe in giving 


Binet IQ? 
List several of the characterist 
30. E individual performance test. 

n many clinical examinations, only P 


Used, to save time. On what basis WOU 


2 
9. ics a tester sho 
tery of performance tests is 


art of a bat 
de which subtests to retain? 


Id you deci 


TH 
E WECHSLER SCALES 


e $ 
a ae important performance 
*^sler's intelligence scales. His effo 


oday are those included in 


tests t 
development began at the 


rt at test 
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Bellevue Hospital operated by the city of New York, where social derelicts of 


many sorts had to be tested. These persons might be feeble-minded, psy 
chotic, or illiterate; estimation of the intellectual level of each case was im- 
portant in determining his disposition. Wechsler prepared the Wechsler- 
Bellevue Scale Form I in 1939 to provide for such clinical evaluations. This 
scale was of great value in military hospitals during World War II and be- 
came one of the chief tools of the clinical psychologist after the war. phu 
ond form was published in 1946 but was never adequately standardize¢- 
The “Wechsler-Bellevue Scale” is now obsolete, having been replaced by 
better-constructed and better-standardized forms. Today’s Wechsler a 
consists of WAIS, the Wechsler Adult Intelligence Scale (1955) for a 9) 
and above, and WISC, the Wechsler Intelligence Scale for Children (194 
for ages 5-15. 

Strange ironies attend the history of test development. Binet set ‘ 
identify mental defectives, yet the most famous piece of research with : 
scale was concerned with children of superior endowment. Wechsler oa 
to prepare a new type of mental test for adults, because adults and childr i 
differ in their interests and approach to work. Yet today his techniqu* 
popular as a children's test. His secondary hope in developing the BA dam 
that patterns of subtest scores would provide a ready means of clinical €" p> 
nosis. The hope was not realized, and this type of analysis is no long” le 
pended upon because empirical checks show that pattern analysis has sical 
validity. Wechsler’s series is now of chief importance as a general indivi 
test for all ages. 


out to 
his 


Test Materials and Procedure 


Wechsler collected a variety of items, many of them from previously mi 
lished tests. He subscribed to Binet's idea of a general mental ability; HN abb 
experience suggested that in the mental patient some types of pam. 
or reasoning are more disturbed than others. Wechsler gave prefere" tio 
items which he had found useful in understanding the intellectual fun? a 
ing of patients. Wechsler sought items which, while falling within ristis 
we identify as general mental ability, had sufficiently specific characte 
to silhouette different types of thinking or performance. o di 

In contrast to the Terman-Binet plan of grouping items according - ts ol 
culty, mixing content randomly, Wechsler arranges them into T d? 
various types. There are eleven scored subtests, grouped in à Werba omp” 
Performance scale. The Verbal scale includes tests of Information, for? 
hension, Digit Span, Similarities, Arithmetic, and Vocabulary. The 
ance scale includes Picture Arrangement, Picture Completion, 
sign, Object Assembly, and Digit Symbol tests. 


jd 
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We shall describe the Verbal scales only briefly and then turn to the Per- 


formance tests. Items are taken from the WAIS form (Wechsler, 1955). 


Information includes such items as "What is the population of the United 
States?" "What does rubber come from?" and "How many weeks are there 
in a year?" Comprehension questions include "Why should we keep away 
from bad company?" and “What does this saying mean? 'Shallow brooks are 
noisy. " The subject is expected to give a generalized, fairly direct answer. 
In Digit Span, the subject is asked to repeat digits forward and backward. 
The Similarities scale asks the subject to tell how the following are alike: 
orange and banana, air and water, poem and statue, etc. Arithmetic is a test 
of numerical reasoning ability using simple verbal problems, such as “How 
Nes ota can you buy for 36 cents if one orange costs four cents?” The 

is required to do the items mentally and receives no credit on an item 
Where he uses more than a reasonable time (e.g., thirty seconds for the ques- 
tion about oranges). Vocabulary requires the subject to define or explain 
Such words as "fabric," “conceal,” and “tirade.” 

The Block Design test was described in Chapter 3. Materials used in sev- 
eral other Performance scales are illustrated in Figure 33. Whereas Block 


Object Assembly Picture Completion Digit Symbol 
yright ©, 1955, The Psychological 


FI 
G. 33. Materials from Wechsler Performance tests. (Cop 


c ; 
?rporation, Reproduced by permission.) 
hole, breaking a pattern into ele- 


Desi 

S : : 

ign requires analysis of à complex W i 
d requires the person to discover 


ments Ob; pu 
» Object Assembly gives the parts an! P 
PV they go together me four tasks are the profile, manikin, hand, and ele- 
i ormance are allowed. 


pha; , ; 
nt. Time bonuses for rapid perf 
(9 1955, The Psychological Corporation. Re- 


2 
ems quoted in this section copyright 


Pr Oduced I 
'y permission. 
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Digit Symbol requires the person to fill in the proper code symbol under 
each number, doing as much as he can in a short time. Our -—]Á 
shows only five symbols; the actual test uses ten. The code remains in € 
of the subject as he works. Thus he can continually refer to the code, or - 
may carry it in his head. Learning the code is easy enough that for ee 
average adults the score becomes a measure of writing speed rather tha 
mental ability. 

There are two picture tests. Picture Completion uses items prese! 
cards, each showing a picture from which something is missing. The ad 
tells what is lacking. In Picture Arrangement, a story is told in three or came 
cartoon panels which are presented in random order; the subject must p a 
them together in the correct order. Here again, the subject must identity 
complex whole from disorganized parts. WAIS. 

The WISC series is a downward extension using easier items than pie 
The same subtests are used, but Digit Span is an optional test for chi zs 
because it has a low correlation with overall performance. A Maze x 
added as an optional performance test. Coding a simple message: ell 
used by Terman, is substituted for the more difficult Digit Symbol full 

The Wechsler scales are comparatively simple to administer, ane 
WAIS requiring about one hour. The directions are less complex ae pt 
for the Binet, and keeping similar items together reduces the task of t put 
aminer. The skill of the examiner may influence the score greatly. In pin 
of the verbal tests, the examiner must make rather sensitive judgments wd e 
the correctness of an answer since it may be necessary to request the ne 
ject to elaborate his meaning. Answers that seem wrong may e in 
when the subject explains himself. Subjectivity in scoring borderline an 
is also a potential problem. 


nted on 
subject 


igit" 
31. Does the Digit Symbol test call for the same mental processes when thre? : pol 
symbol pairs are used as when ten are used? Does Wechsler’s Digit 
test call for the same processes from bright and dull subjects? 
32. Which Wechsler subtests have the following characteristics? 
a. The score is affected by educational background. e. 
b. The test demands experiences found in the urban American cultur 


era 
c. The test requires problem solving or reorganization of knowledg 


het 


than mere recall. end wissle 
d. The test measures very simple mental processes such as Cattell a " 
investigated. he stanfor 
33. How do the Wechsler test items differ from the higher levels of t 
Binet? 
Meaning of Wechsler IQs .. nof 
, s, 1.07 fof 
The raw scores on the subtests are converted into scaled scores: “op fo 


H ers! 
malized standard scores with a mean of 10 and s.d. of 8. This conv 
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WAIS is based on a reference group of adolescents and adults of each age, 
carefully chosen to match the census distribution on sex, geographical region, 
urban-rural residence, race, occupation, and education. A similar sample 
(restricted to white children) was used for the WISC. 

Wechsler introduced standard-score IQs in his first edition, anticipating a 
Practice which Terman and Merrill later accepted. He chose to fix the 
mean at 100 and the standard deviation at 15. The discrepancy between 
Wechsler standard scores and those developed with the Terman-Merrill 
s.d. of 16 is unfortunate, but should rarely be a source of serious confusion. 
Wechsler eliminated completely the mentalage conversion, which is a 
source of some misunderstanding in Binet interpretation. 

Wechsler criticized sharply the original Stanford-Binet assumption that 
mental ability remains constant during adulthood. Mean scores on almost 
any mental test rise during early adulthood and decline later. Wechsler 
therefore developed separate standard-score conversions for adult age 
groups. In the Stanford-Binet system, where adult norms have not been de- 
veloped, a given raw score yields the same IQ at all adult ages. In Wechs- 
ler’s conversion tables, a raw score of 129 yields an IQ of 115 at age 16, 
U at age 20, 114 at age 40, 121 at age 60, and 136 at age 80. Wechsler’s 
other major innovation in scoring was to provide separate standard-score 
Conversions for the Verbal and Perfor 

Wechsler and SB IQs are not interchangeable. When Bayley (1949) gave 


the 1937 SB and Wechsler-Bellevue tests to the same group of adolescents, 
the mean SB IQ was 182 and the mean Wechsler IQ only 122. The Wechs- 


er s.d/s were also lower. This is confirmed by the fact that in Wechsler's 
Standardization for WISC only half as many children had IQs 130 and over 
as in the Terman-Merrill standardization. Even clearer evidence (Table 23) 


mance scales. 


ts for Representative Children in New 


TABLE 23. WISC and SB Resul 


York City 
Correlation 


Mean s.d. with SB 
TEN a a 
Stanford-Binet Form L y: 12.8 82 
WISC: Full scale 103.4 13.6 74 

Verbal scale 98.3 15.0 64 


Performance scale 


Source: J, Krugman et al., 1951. 


Comes from the New York City porti 
Wisc 382 children drawn from eig : 
Breat and SB. The SB IQs ran subs 
standa (J - Krugman et al., 1951). Since 

ardized on carefully selected samp 


‘on of the WISC standardization data, 
hteen schools were tested with both 
antially higher and their s.d. was 
both the WISC and SB scales were 


les, it is hard to decide which set of 
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norms is wrong. The best we can do without much more evidence is to recog: 
nize that SB IQs average some 7 points higher than Wechsler 1Qs during 
childhood and early adulthood. 

Wechsler’s WAIS standardization data are consistent with 
mental ability reaches its peak in early adulthood. In Figure 34, we 


his belief that 
see that 


20 30 40 50 60 


Age 
FIG. 34. Changes in mental-test score with age, based on cross-sec 
tional samples for the WAIS (Wechsler, 1955). An arbitrary common 
scale has been used for plotting the two scores, since raw scores 9n 
one scale cannot be compared with raw scores of the other. 


cepted as a true picture of the course of intellectual growth an al, i€” “g 
nab, ^ p 
ea 


js M 
This and 


+ ans 3 
tio 
that persons at the two ends of the chart belong to different genera! past” 


developed their ability under quite different social circumstances. 
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(1956) points out that Wechsler's older groups have less education than his 
younger samples, which may account for much of their poorer performance. 
Bayley (1955) combined evidence from three longitudinal studies in which 


the same persons were tested on occasions as much as thirty years apart, and 


concludes that the test scores continue to rise at least until the age of 50. If 
her curve based on limited samples (and depending on a vocabulary test for 
many of its points) shows the true pattern of growth in mental ability, 
Wechsler's norms will soon be outdated (see also Bayley, 1957; Bradway 


et al. 1958). He proved that, in 1950, adults born in 1910 (age 40) per- 


formed worse than adults born in 1920 (age 30). But if Bayley is correct, this 


latter group will continue to grow, and in 1960 their scores will average much 
better than Wechsler's 40-year-old sample tested in 1950. Bayley’s data sug- 
gest that cultural changes are year by year raising the mental ability of the 


nation, 
C but not in WAIS. What data would one ob- 


34. A maze test is available in WIS 
e test should be made a part of the WAIS 


tain to decide whether the maz 
scale? 

35. The age curve for the WAIS, based on data gathered in the 1950's, has its 
peak at a later point than the curve based on the Wechsler-Bellevue standard- 
ization in the late thirties. What does this fact imply? 

36. Would a vocabulary score be more or less likely than a performance score to 


improve between ages 20 and 40? 


Adequacy as a Measure of General Ability 

a whole, measures about the same ability as 
.82 between the two tests reported in 
ent validity of the Wechsler 


The Wechsler test, taken as 
the Stanford-Binet. The correlation of 
Table 28 is fairly representative of the concurr 
tests, 

The correlations show th 
the SB than is the Performance 
Bives a significantly higher correl 


scale is much more closely related to 
some studies the Verbal scale 


ation with the SB than does the Full scale. 
There is, then, a real psychological difference between the Stanford-Binet 


and the broader Wechsler. In any composite score, however, elements pres- 
Snt in only part of the test have far less influence on the total than do ele- 


Ments running all through the test. Abilities to apne d pie ee a 
Concentrate, to criticize and correct one’s i asp wc eid h both ihe 
Words and pictures referring to familiar unm ree Ha largel 
erbal and the Performance scales. These general E e Binet tests; $ ii 
termine the total score on both the Wechsler D it fe Aa in- 
abilities found only in arithmetic items or perf besos UE 
uence, but not very much. 


he reliability of the Full sca 


at the Verbal 
scale, and in 


le of the WAIS is reported by Wechsler us dT 
e O: 
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( Wechsler, 1958). This split-half coefficient based on persons of uniform al 
is spectacularly high; it is reasonable to expect a somewhat lower eene d 
between tests at two sittings. For the WISC, split-half coefficients are a : 
above .90. There is as yet little evidence on the stability of Wechsler qne 
we can expect the total score to be as stable as that from the Stanford-Binet, 
with the Verbal score more stable than the Performance score (Bayley, 
1957). The subtests will probably show different degrees of poeni ad 
such evidence may have important theoretical and practical implicatio 


— al 

37. Clinical psychologists frequently distinguish between a patient's perpe 

equipment" and his "functioning ability." Equipment is thought of, ek sum- 

born capacity, but as the maximum intellectual power the person a hew 

mon up at this time. The equipment often does not function at its bes er 

ever, because of impulsiveness, inhibition due to anxiety, autistic lene pan 

other limitations. This point of view argues that people fall below thei 

potential to varying degrees. 

a. In terms of these concepts, what does the Wechsler test reveal? 

b. In terms of these concepts, what does the Binet test reveal? 

€. Which of these concepts comes closest to “intelligence”? 


The Verbal and Performance Scores 


il 

The separate IQs for Verbal and Performance tests measure different 3” 
ities. This is shown by their correlations with the Stanford-Binet, e rerbil 
above, and by their correlation with each other. In various age group "nali re- 
IQ and Performance IQ correlate only .77 to .81, though their split-hà 
habilities are .93 or better. d even 

Most performance tests, taken singly, are extremely unreliable, in lity of 
scales combining several tests may be undependable. The high relia e test 
Wechsler's Performance scale shows that he has done unusually. fin an 
construction. Each performance item requires longer than a verbal : y time: 
it is therefore difficult to obtain as good a sample of ability in limite " 
Emotional blocking, carelessness, and undue haste often cause a pe chslet 
fail a performance item which would otherwise be easy for him. w g’ 
has overcome these adverse influences by writing clear directions, A sco 
variety of tasks with several items of each type, and developing prec ost d€ 
ing standards. As a result, his Performance scale is probably the m 
pendable nonverbal measure ever developed. 

Even though the two scales are highly reliable, the differen , 
them is less reliable (see p. 287). The Wechsler Verbal and Per 3 
scores are quite accurate, and the difference between them has ane ici 
reliability of .74. This is high enough to justify drawing conclusions gi mall 
person whose Verbal and Performance IQs differ by 15 points or 
differences, however, cannot be taken seriously. 


zæ petwer? 
«mane 
dl 


tima 
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p a LÀ does much better on performance tests than on verbal tests, 
ES pect him of having a language handicap. One who learned English 
ate in life, who has had a very limited education, or who suffers from deaf- 
eed m perform badly on tests of word knowledge and verbal reasoning, 
fe pen. of difficulty in understanding the verbal test and because of 
ES E ability to reason verbally. Since performance tasks depend very little 
de 1 and the directions use simple language, verbal handicaps re- 
us he score only slightly. Many adults who might be regarded as defec- 
ve if judged only by their verbal comprehension are able to perform non- 
Verbal tasks at an average level. 
: eR this interpretation is the assumption that people who have de- 
eloped normal performance ability would do equally well in verbal tasks if 


t 
hey had had normal experience. When one can identify restrictive factors 
nterpretation can scarcely be denied. Poor 


in the person's past history, this i 
ase of the child from a bilin- 


Verbal ability is easily understood in the c 
£ual home, the child who has had difficulty in learning to read, and the 


ne who dropped out of school at an early age. Many others, however, 
ow Verbal-Performance differences where no handicap can be identified. 
The psychologist is unable to say whether such differences are due to uniden- 
tified background factors or to some innate lack of specialized verbal apti- 


tudes, 

" While verbal handicaps are easily identi 
us handicap to explain the cases where Performance IQ is well below Ver- 

ances are accounted for by emotional blocking. A 

d of steady work and sometimes a 


erson who becomes upset will 


fied, there is rarely such an obvi- 


eai poor perform 
Mir» rmance test demands a longer perio 
S of trial-and-error operations; the p 
Perform erratically. No comparable blocking occurs in short-answer verbal 
questions where the person’s failures are less obvious to him, although it is 
metimes observed in the Arithmetic subtest. A painstaking, cautious per- 
9rmance will lower the Performance Score. Such undue caution is inter- 


Pre ; X 

Pin as having emotional origins. a . 
e Ometimes the verbal score is elevated by an artificially cultivated vo- 
abulary, Some parents encourag ild large vocabularies, and 


S 
cing students and adults make à to learn new words. The tester 
“Ometimes observes in such subjects 2 love of big words and an effort to give 


i : ; 
™pressively complicated answers to simple questions. The person who has 


2 one-sided verbal development often does better on recall questions (In- 
°rmation, Vocabulary) than on items demanding independent thoughi 


o 
s; Prehension, Similarities). £V 
ar, € there is no single interpretation for any pattern of Verbal-Perform. 
nce diff, erences, such a difference is merely à signal to the tester that further 
ata on the ies saca. Ji study of the test performance as observed, 


great effort 
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> are 
an inquiry into the person’s background, and usually supplementary tests ar 
required to arrive at a deeper understanding of the difference. 


tter on Per- 


38. Digit Symbol is an exception to the finding that delinquents do se Seance 


formance tests than nondelinquents. There is, in fact, a significan 
in the opposite direction. How can this be explained? E carin: 

39. Wechsler deliberately included subtests which are susceptible to emotiona p 
fluences in his measure of "intelligence." In your opinion, does this increase 
decrease the usefulness of the test? . quse 

40. Bill and John, two 15-year-olds, are referred to the school psychologist bec 
both are failing in ninth-grade work, their courses being social s a 
lish, general science, and art appreciation. Both have IQs of 93, but Bill e 
Verbal IQ of 95 and a Performance IQ of 92, while John has a Verbal es- 
87 and a Performance IQ of 106. How would the interpretations and sug 
tions for dealing with the two boys differ? Ver- 

41. A relatively low Performance IQ suggests emotional disturbance. When 
bal IQ is lower than Performance IQ, can we conclude that the person d 
adjusted? (Consider Mark, p. 191, in this connection.) 


well 


Interpretation of Subtests and Profiles 


empt® 
The Wechsler scale is neatly organized into subtests, and many ipe 
have been made to develop separate interpretations of the several Su?" is 


"Low Vocabulary means poor ability to deal with symbols." Voc? 
affected by the subjects acquaintance with our verbal culture, » 
schooling, and his ability to express himself, which may be impeded by pasi’ 
tion. It differs from Similarities in that Vocabulary can be passed on the ele- 
of recall, whereas Similarities requires reorganizing information. 
ments that influence each particular subtest can be known only t 
great experience with the test and study of the research literature. and 
Wechsler finds some association between patterns of subtest scor iini 
particular types of mental disorder, and has recommended the test for 
cal diagnosis. Many clinicians have developed special formulas of f 
for combining subtest scores into indices supposedly characteristic « Jinic? 
damaged patients, schizophrenics, etc. As an illustration of these d in 
hypotheses we may quote Schafer's description of the patter? fou 
psychopathic character disorder (1948, p. 54): the 


r 
sg , - ] ove ts 
The characteristic pattern is a superiority of the Performance leve test 


Verbal, low scores on Comprehension and Similarities and high scores A p j 
of visual-motor coordination and speed [Object Assembly, BD, Dig} pland” of 
Often the Digit Span score does not drop, reflecting the characteristic y true 
Frequently Picture Arrangement is conspicuously high. This is espe e pfuln® 
shrewd “schemers.” If Picture Completion is high, over-alertness or W* 

is probably characteristic. . . . 


J uessinÉ 
Qualitatively the chief feature is usually blazing recklessness in £ 
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„answers, . . . "George Bernard Shaw wrote Faust," "Magellan discovered the 


North Pole,” “Chattel means a place to live (chateau),” “Ballast is a dance (bal- 
let),” “Proselyte means prostitute,” and so forth. . . . The over-all pattern will 
indicate that this is a bland, unreflective, action-oriented person whose judgment 
1s poor, whose conceptual development is weak, but whose grasp of social situa- 


tions may yet be quick and accurate. 


Some of the proposals for diagnostic interpretation have been little more 
than plausible guesses or generalizations from small, unrepresentative sam- 
ples. Even the suggestions based on sound research have limited practical 
value, primarily because they rest on unreliable difference scores. Only un- 
usually large differences between subtests (greater than 8 scaled-score units) 
Should be taken seriously ( Wechsler, 1958, p. 164). 

There is theoretical justification for expecting brain damage to impede 
one type of performance more than another, or for expecting psychopaths to 
Suffer where pretentious, incautious responses are penalized. The effect of 
Personality is masked, however, by the influence of general mental ability, 
other aptitudes and experience factors, attitudes in taking the test, and ran- 
dom errors, Many studies agree that on the average schizophrenics have 
Verbal IQs higher than Performance. But when we look at Rapaport's data 
(1945, Appendix II), we find that only 81 out of his 72 schizophrenics have 
Verbal IQs five points or more above the Performance IQ. Even among the 
highway patrolmen used as a comparison (“normal”) group, 18 out of 54 


Showed this "sign." 
Basing diagnosis on multiple signs T 


rarely does a patient show all the signs o 
find statistical trends which distinguish groups of psychopaths (for example) 


from other groups. No objective treatment of the Wechsler scores has proved 
able to classify individual patients with a useful degree of accuracy. Indices 


representing "scatter" of subtest scores—e-§> the range from highest to low- 
est subtest score—are worthless as diagnostic signs (Patterson, 1953, pp. 


E identify psychopat! 

It will be noted that Schafer did not propose to identify psychopaths by a 
Numerical treatment of subtest scores- He examined the nature of the errors 
and Successes to arrive at a qualitative of the personality. The Wechs- 


er s z s ts or interview procedures 
cale is i erior to 
n some ways sup " » 

as an aid in formin g su 1 impressions, because the questions are the same 
Or all subjects, are varied. and elicit highly revealing Ed - iih 
5 8 3 :der i 
Cian wishes to describe the subject, he should consider the Wechs'er sub tests 

o describe at he may be inter- 


Sie : -ness th 
n Widually and qualitatively (with due awarenes 


e an 
Preting m d ation). The clinician must not regard an impression 
m varia : The impression is useful, but it is not a 


Orm P 
ed in thi diagnosis. 

; s manner as a diag : 2 
“clentific seine The Wechsler yields a general measure of mental abil- 


educes errors of classification, but 
f his class. At best, one can hope to 


picture 
other tes 
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ity and a verbal-performance difference, and beyond that can offer hints 
leading to further study of the individual. 

Before going on to a more general discussion of performance tests, i : 
summarize briefly the virtues and defects of the Wechsler scale. It is pen 
ciently designed, interesting to most subjects, and at least as valid for on 
dictive purposes as the Stanford-Binet. It covers a broader range of me 
and affords exceptionally good opportunities for qualitative ae 
behavior and thought processes. The norms for the test, once à point o aa 
ous criticism, have been greatly improved. As a practical individual tes A ^ 
Wechsler falls short in only one particular: the scale has insufficient rang 
measure very high and very low abilities dependably. . trib- 

The test is, however, a distillation of clinical experience, and this cop lex 
utes both to its strength and to its weakness. It is a useful sample of ae is 
behavior in which emotional and intellectual factors are entwined. eer 
based on no clear theory of intelligence and makes no serious effort poset 
rate mental ability from other aspects of adaptation. The tasks gene uate 
from techniques invented thirty years or more ago, and there is no as E ja 
rationale for interpreting the subtest scores. It is reasonable to hop m 
some future worker will start from a theory of mental processes, di. 
design tests to measure those particular processes, and so arrive at a * Jy cor” 
diagnostic device. The total score on such a test would almost certain y 
relate substantially with Wechsler’s. 


we can 


r 
echsle 

42. Many clinicians have tried to select an abbreviated test from the ior rer 
series so as to obtain a quick measure of ability, though one of in p thre? 


liability (McNemar, 1950). What would you consider in deciding |n 

subtests to use? Which three subtests seem best to you for this purpen mental 
43. Is a high or a low correlation between subtests desirable in a gener 

test? eac of 
44. What description of the patient's thought processes is suggested PY Schafer 

these responses to “Why should we keep away from bad company 


1948). sonne 
a. Your friends will talk about you; if we want to live in a good envi , 
we must choose good company. (IQ — 107) their pe 
b. | don't know if that necessarily holds true. To prevent picking up i 
habits, | guess. (IQ = 123) d yours" 
€. It’s a trend toward living the same kind of a life, get pa 


(IQ = 127) why P 
45. Match the responses in the preceding question to these answers f° 
we have laws?" given by the same set of patients. me 
a. Govern the behavior of people. [E queries.] There has to be 5 a 
nance of order by which government policies are carried oU 
personal behavior of individuals. 
b. To have a law-abiding group of people; otherwise they 
city. 


c. To make good citizens out of us; to keep the unruly 


the 
would corrupt 


under control- 
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46. Harper (1950), comparing 245 schizophrenics to 237 normals, established re- 
liable differences between subtests. A formula for combining standard scores 
on subtests is offered: .28 Inf—.15 Comp + .17 DSp—.19 Pic Com + 
.25 BD — .35 DSym (+ other small terms). A “cutting score" halfway between 
the mean for normals and the mean for schizophrenics was used. In a new 
sample, 68 percent of schizophrenics fell beyond the cutting score. The formula 
is thus shown to be truly discriminating. In view of the large number of mis- 
classifications, what value does the formula have in practice? 

47. According to Harper's formula, schizophrenic profiles tend to have a high 


point on Block Design and a low point on Digit Symbol. 


a. Can you explain this? . . 
b. Could you give an equally convincing explanation if the opposite had been 

found? 
"intelligence" test to diagnose 


48. What advantage would there be in using an 
ing a "persona 


abnormal personalities, over US! 
for that purpose? Would this argument hold 
definitely higher validity for diagnosing such cases? 


lity” test having similar validity 
if the “personality” test had 


WHAT PERFORMANCE TESTS MEASURE 
There is no need to describe performance tests other than the Wechsler in 
detail. Some, like the Arthur scale, are collections of tests covering a variety 
of performances. Others, like the original Kohs Block Design Test (see p. 41) 

Or the Porteus mazes (p. 29) are devoted to a single type of item. 
Some performance tests are better measures of general ability than others, 
make a greater intel- 


HN because they are more reliable or because they 
€ctual demand. Simple timed formboards demand manipulative speed 


more than thought and have rather low correlation with general ability. In 
the WAIS the tests which correlate highest with the Performance IQ are 
Block Design and Picture Completion. Digit Symbol actually correlates 
higher with the Verbal IQ than with the Performance IQ (Wechsler, 1958, 


P. 255), 


Cc 
ultural Influences 
at performance tests are “culture free.” 


are completely uninfluenced by 
a test would give a fair com- 
and across different social 


en claimed th 
one on which scores 
onment. Such 
fferent countries 


E requently it has be 
A culture-free” test is 
eXperience in a particular envir 
Parison of mental abilities in di 
Classes. 

a verbal test. This is illustrated 


Educ i i [o dir dyi 
ational handicaps show up Heec) » R ic li 
Mes cn "n al-boat cl ildren, who live a nomadic life and 
ave 8 tellectual environment. Binet tests correlated 


a performance test correlated only .26. The 
higher (Gaw, 1925). 


58 an impoverished, unin 
p un educational level, but 
Ormance IQs were about 10 points 
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The skills involved in performance tests are developed through eei 
and every culture provides some amount of training along the lines gm 
Egyptian psychologists examined mental development in a possum um 
living on the edge of the desert (Fahmy, 1954). They found that scores E 
most performance tests were quite a bit below the European »— of 
children of the same age. On a test, however, which called for assemb : k 
colored mosaics (similar to Block Design) these children averaged d ji 
above the European norms. Color plays a large part in the GEMENS 
this culture and in the children's games. This evidently helps their test on 
formance by giving them experience in examining patterns or by develop 
their interest in such tasks. (See also Havighurst et al., 19-46. ) jv in- 

Training in particular types of discrimination and reasoning see 
fluences only a few performance measures. The subtle effects on atune m 
motivation are likely to affect all tests. The educated classes in cape 
Europe, and in nations influenced by Western civilization are de " 
early childhood to take intellectual matters seriously. The child is pem »* 
for answering adults’ seemingly pointless questions. He shares pee " take 
word games with his playmates, and these experiences also cause him m n 
artificial problems seriously. These activities teach an attitude of self-cr 
ness and competitiveness. 

There have been many attempts to compare the mental abilitie: jn 
nations and racial groups by means of performance tests or translate", i 
tests. Differences in average performance are found in most studies, : es 
differences are fairly small. In every group tested a large a" tests: 
better than the average of the white sample used in standardizing P alent 
This is evidence in itself that no one racial group has a monopoly E M put 
Precise comparisons of group averages have no practical import. 4 all 
they might be of great scientific importance if tests were equally m me» 
groups. It is now generally agreed that no universal test for measuring ^. 
tal ability can be developed. Any test calls for habits and attitudes ^ per 
some cultures favor and other cultures inhibit. The test shows how am tasks 
sons tested have developed along those lines, not how they rank on @ 
or how bright they are innately.’ 


s of variou 


f 
each E 
49. In what ways, if any, might cultural differences affect performance en 
the following tests? 
a. Formboards (fitting blocks into variously shaped holes) 
b. Wechsler Picture Arrangement 
c. Porteus mazes :: sam 
y write "^ pt 
? Racial comparisons have frequently been misinterpreted because liber ves a 


to prove that there are no innate differences in ability, and certain Do 
prove that nonwhite groups will not profit from improved educational PE an 
anced accounts of the many studies and of their possible interpretati 

L. Tyler, 1956, pp. 276-309, and Anastasi, 1958, pp. 542-575. 


tuni 
por j 


MEASUREMENT OF GENERAL ABILITY 205 


Emotional Influences 


Performance tests generally demand a longer period of sustained attention 
than the shorter items of individual verbal tests. This provides a greater op- 
portunity for confusion or frustration to build up, and as a result the perform- 
ance test is more likely to reflect emotional disturbance. Porteus (1950) dis- 


cusses two studies of boys and girls in a reformatory, where maladjusted 
delinquen ts were compared with law-abiding, well-behaved inmates. In both 
adjusted groups had similar Binet IQs, but on 


studies, the adjusted and mal 
the Porteus maze the maladjusted group dropped about 10 points below the 


Others. Another study found that group psychotherapy raises the Porteus 
MA of schizophrenic patients by two years (H. N. Peters and F. D. Jones, 
1951). This suggests that the psychotherapy releases ability previously sup- 


pressed by emotional conflicts. 


Practical Correlates 

The fact that performance tests are relatively independent of educational 
background raises their validity for some purposes and lowers it for others. 
When a tester is trying to predict subsequent educational achievement, the 
Verbal test is likely to be more informative. Whatever handicaps depress the 
Verbal score will also interfere with future attainment in most schooling, as 


Was noted in E. L. Thorndike's comment ( p. 189). He went on to voice the 
a less verbal test would be a better predictor of 


Common expectation that 4 
vidence to support this view. In one 


practical adjustment. There is some e : 
Study, the adjustment of borderline mental defectives in the community cor- 


related .77 with their Porteus maze scores; but only .57 with Binet scores 
(Porteus, 1939). An earlier study used ratings of the efficiency of children in 
a school for the mentally retarded as criteria. There were separate ratings on 
“educational efficiency” and "industrial efficiency" (i.e. performance in oc- 
Cupational training). The Binet predicted the former much better than the 
maze (the respective correlations being .81 and .59 for girls). But the Binet 
Predicted the trade-performance criterion less well (.66 vs. .75) (Berry and 
Porteus, 1920). One might expect the Wechsler Verbal score to correlate 
igher with shoal success than the Performance aore: ie g penen 
Correlations reported, however, show nearly equal nina noe p he c 
cores (0B ani £5 respectively; Prandsen an PERS ond excita 
Ussen et al., 1952). Further inves are nee P 


this find? 
nding. in the clinic. Perf 
Th ial im ortance 1n the clinic. Ferrormance 
€ performance test has spec! P. more on ability to attack a new 


tests ge bit and 
nerall d less on habi 
Problem, They eee quicker to reflect the adverse effects of emo- 


tigations 
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tional disturbance or brain damage, and perhaps quicker to reveal the edi 
of therapeutic treatment. In cross-sectional studies of aging, for srat 
nonlanguage tests begin to decline in the 20's, whereas verbal ability i" 
almost constant until the mid-40's. When radical brain surgery is performe E 
in an attempt to aid a psychotic, we might expect his intellectual qe 
ance to be affected. His MA on the Binet or Wechsler Full scale or 0? : 
group test will not be changed in any predictable way, the average "es 
being negligible. His MA on the Porteus mazes will show an pe iir 
of about 2 years, though it will ultimately recover and rise far above the a 
nal level (Mettler, 1949). This change is consistent with the clinical se 
The patient’s personality shifts from a depressed, worried state to a seme 
adaptable state in which he gives little thought to the future. He then gr di 
ally stabilizes his behavior in a socially constructive pattern. The pu 
tive loss of planning and foresight is observed in his maze pudiese v 
it of course leads to error. A similar decline appears in Object Assemb nen 
Digit Symbol. No impairment is found in the Verbal tests or in Block De: 
Picture Completion, and Digit Span. 

Since performance tests involve spatial and perceptual abilitie: 
dict success in certain types of jobs (see Chapter 10), they might 
significance for vocational guidance. As a factor analysis of the We 
be reported later shows (p. 264), however, its subscores do not reve : 
separate abilities clearly. Other general performance tests are even ean 
factory as measures of special ability. Tests which provide purer escape" 
of specialized aptitudes will generally give better information for 
tional choice. id 


s which pte 
have some 
chsler t° 
a] thes? 
satis 
ures 


s š 5 ers 
50. If it were practical to use an individual test for selecting Army officers, 


a verbal or a performance test be preferable? sy is show” by 
51. Comment on this statement: “A person's true level of mental ability i 4 
whichever IQ, verbal or performance, is higher." tests 2" 


52. Leona Tyler (1956, P. 10) makes this statement about performance ubstiluf e; 
nonverbal tests: “If they are worth less to us than we expected as kis ents: 
for the typical verbal intelligence test, they are worth more as voa t pla?" 
What evidence justifies this statement? What implications does it have 
ning a testing program? 


NOTEWORTHY INDIVIDUAL TESTS m 
measure : at 
forman? yd 
T Safe 
Our su" fof 


The Wechsler scale, combining as it does a good performance 
a good verbal measure, has almost entirely replaced earlier per 
teries. Among general-purpose predictors, the Wechsler and 
Binet are equally prominent, with no other serious competitor. teri 
of important individual tests, then, would have few entries if the DE dra 
admission were wide use at the present time. Attention needs to 
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however, to some little-used tests of good quality, and to tests mentioned in 
the basic research studies of earlier days. Revisions of some of these tests are 
being made, and they may again become prominent. 

Each listing below gives the title of the test, the authors, the publisher, 
and the date of major editions (including always the earliest and latest); 
ages or grades for which suited; remarks about nature, purpose, and quality. 
In preparing these statements, the writer has relied heavily but not exclu- 
Sively on comments made by reviewers for the Buros yearbooks. 

© Columbia Mental Maturity Scale; B. Burgomeister, L. H. Blum, Irving 
Lorge; World Book, 1953. Ages 3 to 12. Each item consists of three or more 
drawings printed on a large card. The child points to the one which does not 
belong with the others. Well suited to testing physically handicapped chil- 
dren, Though the test is brief, reliabilities near .90 are reported. Correlates 
about .75 with the Stanford-Binet. 

9 Draw-a-Man Test; Florence Goodenough; World Book, 1926. Ages 1 


to 10. The child is asked to draw the best man he can. Scoring takes into ac- 


count the basic structure of the drawing (e.g. are the arms attached to the 


trunk?) and details of features, clothing, etc. This is a simple test to adminis- 
ere carefully prepared. Though the Draw-a-Man can 


dependent on cultural influences. Some com- 
are used as a technique for examin- 


ter, and scoring rules w! 
be applied in all cultures, it is 


Parable tests ( e.g., House-Tree-Person ) 
ing personality, rather than as a measure of intellectual development alone. 


© Leiter International Performance Scale; Russell G. Leiter; C. H. Stoelt- 
ing, 1936, 1948. Ages 2 to 18. The tasks require perceptual matching, anal- 
Ogies, memory, and other varied items, many of them similar to verbal tests, 
The test is given with very simple directions (spoken or pantomime), and 
the items themselves require no language. The test has many excellent fea- 
tures, being especially suited to handicapped children. The IQ conversions 


are of questionable accuracy at preschool levels. 

e Meril mer Scale; ached Stutsman; Stoelting, 1931. Ages 2 to 5. A 
Scale for preschool children using interesting games, puzzles, pictures, etc. 
Language questions are simple tests of comprehension (“What cries?"). 
Some tests involve dexterity (cutting with scissors). Speed is heavily empha- 
Sized. The technical quality and content of the 1931 version compares unfa- 


Vorably with the Stanford-Binet. 3 
© Minnesota Preschool Scale; Florence Goodenough, Kathryn Maurer, 


M. J. van Wagenen; Educational Test Bureau, 1932, 1940. Ages 1% to 6. Ver- 
bal comprehension and memory tests are used in a verbal score. A nonverbal 
scale includes form recognition, tracing, picture completion, block building, 
and simple puzzles. Some long-term follow-up studies of predictive validity 
accurate test for ages 3 to 5, but not one with 


Aave been made, This is an 
&reat appeal for the child. 
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e Pintner-Paterson Scale of Performance Tests; Rudolf Pintner and D. G 
Paterson; Psychological Corporation, 1927. Ages 4 to 16. This was the T 
substantial performance battery. It included object assembly, formboarcs, 
and Healy Picture Completion (pictures to be completed by € 
blocks). It played a major part in research and clinical work prior to the - 
velopment of the Wechsler scale. Scores depend heavily on speed, and relia- 
bility is unsatisfactory. 

e Point Scale of Performance Tests; Grace A. Arthur; Stoelting, J- 
logical Corporation, 1925, 1947. Ages 44 years to adult. One of the best ps 
lections of performance tests standardized on the same sample. Incluc 
formboards, maze, block design, etc. MAs tend to be lower than SB MA 
ing to defects in the standardization. ton 

e Stanford-Binet Scale; L. M. Terman and Maud A. Merrill; Houg® 
Mifflin, 1916, 1937, 1960. Ages 2% years to adult. (See pp. 1684) en, 

e Valentine Intelligence Tests for Children; C. W. Valentine; Met nee 
1945, 1953. Ages 1% to 15. A British scale combining items from well- ly 
sources (Gesell, Burt-Binet, Stanford-Binet, Merrill-Palmer, etc.). — is 
regarded as a superior test for preschool ages, though its standardizatlo 
inadequate. nora 

@ Wechsler Intelligence Scales; David Wechsler; Psychological en » 
tion, 1940, 1955. WISC, 5-15 years; WAIS, 16 years to adult. (See pP* I 


Psycho- 


s OW 


TESTS OF INFANT DEVELOPMENT 


" í nd ob- 
Tests such as we have discussed to this point set a task for the child " o the 
serve how well he can perform it. Such a method cannot be applie¢ do 


. arned 
infant, who does not comprehend instructions and has not Jearne serva" 
things on command. Tests of early development consist primarily ony 

tions of the child’s response to stimulation. he 
child is showing the developments normal for his age, rather 


mental level specifically. We cannot test a 1-year-old on abstract : 
annot tes 


adequate Jo 
problem pe 


2 a 
vestigators have concentrated on those aspects of behavior which < tudy 
identified objectively in the young child. Bayley, 
(1933), used a composite of 185 items from existing scales, 
lowing are representative. The number in parentheses is 


: und: 
—the age in months at which the development 1s normally fo 


(0.6) Lateral head movements, prone. 
(1.4) Vertical eye coordination. 
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(3.0) Reaches for ring. 

(8.6) Manipulates table edge. 
(5.5) Discriminates strangers. 
(5.9) Vocalizes pleasure. 
(6.6) Lifts cup by handle. 
(8.6) Says da-da or equivalent. 
(9.3) Fine prehension. 

(9.9) Rings bell purposefully. 


(18.5) Makes tower of two cubes. 
(16.6) Turns pages. 

(20.1) Square or triang 
(21.5) Names three objects. 


le in Gesell formboard, reversed. 


(25.0) Understands two prepositions. 

(28.4) Picture completion. 

(34.6) Copies circle; one success on three trials required. 
(85.6) Remembers one of four pictures. 


sis on sensorimotor development in the infant tests 
as measures of mental ability. As Boyn- 
), "When the Linfert-Hierholzer Scale 


attempts to measure intelligence in terms of the child's ability to follow visu- 
ally a ball or to use a spoon in eating, or when Charlotte Bühler looks for in- 


telligence in a child's smile or in the fact that he seeks a lost toy, it is appar- 
atters which neither the layman nor the 


ent that the procedure involves ma 

Psychologist would regard as integral aspects of intelligence at a later age.” 

One way of meeting this objection is to regard the data as meaningful in 
t is doing. Information about the 


their own right, as showing what the infan 
normal development of coérdination, for example, may be important for the 
Pediatrician who must recognize and diagnose disease, dietary deficiency, or 


abnormality. Data about development of sensorimotor behavior may be im- 


Portant for psychological theory also. 
Most investigators, however, have wanted to forecast mental develop- 
of children for adoption, 


ment, The psychologist dealing with placement 
for example, desires a good early measure of mental ability. A good mental 
measure early in infancy might also be of value in identifying certain types 


of mental defect which can be overcome by early application of appropriate 
drugs, For such applications, the validity of the infant test as a mental test 


must be examined. 

The correlations in Table 21 (p- 176) show that tests in the first two years, 
Where the items are predominantly sensorimotor in type, have negligible 
correlations with tests at school age. A test at age 2 or 3, however, has fair 
ability to forecast school-age intelligence. 

The rise in correlation is to some extent a T 


The heavy empha 
makes it impossible to interpret them 
ton comments (Monroe, 1941, p. 629 


eflection of increase in reliabil- 
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ity with age. Tests for very young children are apt to be unreliable, in the 
first place, because the child's attentiveness and alertness fluctuate. Even if 
enough items are used to obtain a good estimate of his status in a given weck, 
his standard score shifts markedly from month to month. All infants show un- 
explained spurts of development. The child may forge ahead rapidly in loco- 
motor ability and then remain at the same level with no further change tor 
weeks. Moreover, he may make progress in only one area at a time, s 
ing his vocabulary while his coórdination shows no further advance or vie 
versa. b 
Accuracy of measurement can be increased by using more items, or by 
combining estimates made in successive months. J. E. Anderson (1989, P 
376) suggests other precautions to increase the dependability of the meas 
ures: 
an be 


. Ji; c 
The earlier . . . the measurements are made, the less reliance ee 


placed on a single measurement or observation, if that measuremen 
observation is used for predicting subsequent development. 

The earlier . . . the measurements are made, the greater care 
be taken to secure accuracy of observation and record and to 
standardized procedures. 

The earlier . .. the measurements are made, the more 8€ 
should be taken of the possibility of disturbing factors, such as neg? 
ism and refusals, that operate as constant errors to reduce score. there 

Since development is a timed series of relations or sequences, f the 
are for many functions periods below which only a small portion 3 sor" 
function can be measured and above which a progressively larger ited 
tion can be measured. Hence, the possibilities of prediction are li! if 8 
and progression with age is not an infallible indicator of the hear ac 
measurement. Every effort should be expended to secure the eA [cri 
curate and predictive tests by standardizing tests against multip ° ainst 
teria, particularly measures of ability in later life] rather than a£ 
single criteria. 


should 
follow 


count 
tiv- 


ja 
This last point is at the heart of the difficulty. If the functions that oa i 
tute intelligence cannot be observed early in the child's life, subst ur 
measure of nonintellectual functions is no solution. We must wait P thes? 
poseful problem solving is present; more than that, we must wait m while 
types of behavior are reasonably well stabilized, since measures = conc", 
a type of behavior is just emerging are notoriously unreliable. This ops of 
Sion is supported by the results of Maurer, cited earlier, that observ en task 
undirected behavior are not predictive of later IQ, but suitably a age of 
performance is. Since tasks cannot be set for the child much below t 
2. there is little hope of predicting the IQ from tests in infancy: 


y 
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A much more optimistic but still cautious position on infant testing is taken 
by Escalona (1950). She regards the test as an opportunity to observe the 
functioning of the whole organism in “a situation which has structural and 
dynamic properties”; ie., she turns away from an attempt at exact measure- 
ment of a single attribute and uses the test to enrich an impressionistic ob- 
servation. Her position is influenced both by the field theory of Kurt Lewin 
and by the psychoanalytic approach to test interpretation represented by 
Rapaport and Schafer. She considers the child's total social response, man- 
agement of his body, and attention pattern in Gesell's tests, and tries to judge 
his development qualitatively. One case she describes as follows: 


An infant was first seen at the age of three months. At that time he gave every 
evidence of making unusually good developmental progress, earning test scores 
which placed him in the accelerated range. He was characterized by a very high 


activity level, bodily activity increased markedly in response to all stimulation. His 
f ion seemed lower than that of most infants of 


Capacity to tolerate delay or frustrati ants. 
the same age. At the three months age level, test performance reflects primarily 
Bross motor coordination, vigorous responsiveness to stimulation and perceptual 
diserimination At a later age, however, tests are designed so as to also elicit fine 
motor coordination which requires inhibition of impulse, as well as problem solving 
ehavior which implies delay in attaining a goal. 

e that the child tended toward immediate discharge 

nhibition of impulses frus- 


ge IQ on the later tests. 


A prediction was mad ^ À 
of tension, would probably find tasks calling for i 
trating, ; ikely rn only an avera 
& and would be likely to earn on ai 

The child was retested at 9 months and at 22 months. Ee qs — 
€ Was again noted to be a more than ordinarily active child. His total IQ 
T ms requiring fine motor 


drop ed orior to the average range. Ite 
Pped from the superior to g M E a de i gem 


ordinati iri layful 
ation and those requiring a pay bali 
Were passed at a low level or were refused altogether. Sisa PEA 
requiring immediate grasp of a problem, however, were perf apes id 
average and superior levels. Gross motor coórdination remained outstand- 
ngly good. her a child has performed at 
In many ; " the tester can judge whether a ) 
y instances th tegories in this respect. Those 


his 5 UM 
est. Escalona divides children into two €* : i 
Whose tests are judged “optimum” change their standing very little on tests a 


Year or so later. No prediction can be made for those whose tests are judged 
race in this group. 

NOnoptimal There are many large changes in this gr un ] 
: between the psychometric and the 


. Here we have a characteristic contrast 1 ain : 
"pressioni approach to testing. The psychometric criteria applied by 


: :5 infant tests. Escalona's clinical 

ayley indi -edictive value in 1n: 

icate almost no predic iix. id 

Method is said to sivemiot 2 nly a statement of developmental wd which is 
l O gl s 5 .. CN A 

Predictive in ie cases but also a qualitative bunt Ad 7 "iim JE 

Weak points, TI 7 is no reason to question the correctness of either view- 

. nere I: 
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point. It is obvious, however, that not just any impressionistic interpretation 

can be depended upon. Until Escalona communicates the observational cues 

and the theory she uses to interpret tests, and until the various interpreta- 

tions have been checked by systematic research, her method cannot be used 

by others. 

53. Characterize the abilities used within each year level of Bayley's scale (assum- 
ing that the items listed are characteristic). 

54. How do the abilities tested in her scale differ from those tested in the Stanford- 
Binet? 

55. To what extent would differences in experience give some children a 
tage on the tasks listed? 


n advan- 


Important Infant Tests 
of Califor- 


e California First-Year Mental Scale; Nancy Bayley; University 
her scales 


nia Press, 1933. Ages 1 to 18 months. A set of items chosen from ot es 
and standardized by retesting the same group of about fifty infants md 
edly. More data are available on this scale, taken as a whole, than on oth 

infant tests. - 

e Cattell Infant Intelligence Scale; Psyche Cattell; Psychological Corpo : 
tion, 1947. Ages 2 to 30 months. This is an attempt to extend the stanfor 
Binet downward. Items are somewhat more complex than those in e 
schedules, but at the youngest ages simple perceptual responses (e87 d" 
ing at a moving person) are counted. The test at age 1 correlates 4 
.56 with SB IQ at age 3 but has very low correlations with school-ag? | 
(Cavanaugh et al., 1957). It has no predictive value before the first 4 
day. 

e Gesell Developmental Schedules; Arnold Gesell and others; 
cal Corporation, 1925, 1949. Ages 4 weeks to 6 years. A schedule of b : 
divided into four areas: motor, adaptive, language, and personal-s0€ i 
child is stimulated, e.g., by placing a block in front of him, and his reac 
are compared with expectations for his age. The standardization, 
and interpretation of scores are open to question. As Anastasi says 
p. 283), "These schedules may be regarded as a refinement and elabo 
of the qualitative observations routinely made by pediatricians. f Lon” 

e Griffiths Mental Development Scale; Ruth Griffiths; University - jal 
don Press, 1954. Ages 0-2 years. Five carefully prepared scales, using or S pal- 
items together with those of Gesell and others, measure locomotor, pers tota 
social, hearing-speech, hand-eye, and performance developments. a 2 
scale includes 260 items and permits more reliable measurement tha 1 is 
other instrument. Retest reliability with more than a six-month 
.87 (Griffiths, 1954). Too little research is available to evaluate th 
present. 


psychologi 
Y pavios 


al. 


ratio” 
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Suggested Readings 


Anastasi, Anne. Race differences: methodological problems. Differential psy- 
New York: Macmillan, 1958. Pp. 542-575. 
An authoritative discussion of the proper interpretation of studies comparing 
test scores of racial groups includes some representative findings. The subse- 
quent chapters review several major investigations, particularly of differences 
between Negroes and whites. 
Brown, Elinor W. Observing behavior during the intelligence test. In Eugene Ler- 
ner & Lois B. Murphy (eds.), Methods for the study of personality in young 
children. Monogr., Soc. Res. in Child Developm., 1941, 6, (4), 268-283. 
Responses of two 4-year-olds to the Stanford-Binet are presented to show that 
performance depends on personality and response to the examiner, as well as 
on intellect. Students should read this protocol if they have not seen a dem- 
onstration of individual mental testing. 
Richards, T. W. Mental test performance as a reflection of the child's current life 
Situation: a methodological study. Child. Developm., 1951, 22, 221—238. (Re- 
printed in Eugene L. Hartley & Ruth E. Hartley, eds., Outside readings in psy- 
chology, 2nd ed. New York: Crowell, 1958. Pp. 260-273.) 
A child's Binet performance from age 8 to age 10 fluctuated from IQ 115 to 
IQ 140. Richards traces observation records, parent attitudes, and personality 
tests to show a correspondence between test changes and changes in the pres- 


sures and satisfactions in the child's life. 
Schofield, William. Critique of scatter and profile analysis of psychometric data. 


J. clin. Psychol., 1952, 8, 16-22. nn 
Schofield reviews the studies claiming to find information in Wechsler profile 
shape that can be used for clinical diagnosis. Wishful thinking, accompanied 
by inadequate research design, is blamed for the widespread and unjustified 
faith in profile interpretation. The faults in this research should be noted in 
planning any validation study. 

Terman, Lewis M. The discovery and encouragemen 

Psychologist, 1954, 9, 221-230. (Reprinted in D 

eds., Contributions to modern psychology. New Yo 

1958. Pp. 51-65. Also in H. H. Remmers & others, eds., 


learning. New York: Harper, 1957. Pp. 63-77.) . 
This lecture surveys some of the principal American work with mental tests, 


including Terman's follow-up of exceptional children. Terman reviews the 
childhood differences between those who succeeded in later life and those 
whose careers were mediocre, emphasizing the cultural factors that bring 
talent to fruition. 


chology. (3rd ed.) 


t of exceptional talent. Amer. 
on E. Dulany, Jr., & others, 
rk; Oxford University Press, 
Growth, teaching, and 


Group Tests of General Ability 


ause of 
of sub- 
are in- 
vidua 


GROUP tests are used far more extensively than individual tests bec 
their economy and practicality. Particularly in dealing with masses 
jects, whether in the Army, industry, schools, or research, group tests 
dispensable. The better group tests are as reliable as comparable indivi’ 
tests, and for many objectives they have equally good predictive validity: 
Moreover, they do not require specially trained testers. 

The group test is based on the assumption that subject: 
nature and purpose of testing, and that each wants to do his best. 
these ideal conditions are not met, the scores of some individuals will be si 
valid. The individual test gives the examiner a good chance to note th 
the subject is ill, unduly tense, or confused by the directions, and thus ihe 
experienced examiner can recognize when the score is invalid. In the gm A 
test, a standard procedure is applied, and no special consideration jer n 
given to individuals. The group test is the practical solution to the pr? » 
of obtaining information when large numbers of individuals must be ds 
sidered at once—for example, in classifying recruits or identifying P" m 
who cannot keep up with the normal pace. Wherever the important © p 
tive is to make decisions which are correct on the average, the group S a 
suitable. Wherever the primary consideration is thorough understan i it 
the individual, the flexibility and intimacy of the individual test me s i- 
much more satisfactory. In schools, group tests are often used as à prelim 
nary device to identify pupils to be studied individually. 


s understand the 
Wherever 


REPRESENTATIVE INSTRUMENTS 


Most of the early group tests were based on the “omnibus” or hodge pod 50 
principle of the Binet scale. The test mixed a great variety of proble tic) 
that specialized abilities called for by certain questions (e-g» arthur the 
had very little influence compared to the general ability require a pe 
problems. As the makers of the famous Army Alpha Examination pur ? 

214 
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ideal was to find tests all related to the criterion and having very little rela- 
tion to each other. The omnibus test with items in haphazard order or with 
many short subtests was the most common type of group test until the 1940's. 
Such a test was used to obtain just one score, a measure of general ability. 
Many recent tests are designed so that sections can be scored and inter- 
Preted separately. 

Instead of using the omnibus test wl 


eac . ; 
ach other, some British workers limited their group tests to items thought to 
Charles Spearman was the leading spirit 


eneral ability by finding items which 
Ise. In the course of this work he in- 
lays a large part in cur- 
Tent test development (see Chapter 9). We need not at this point elaborate 
9n Spearman's method beyond sayin in effect, he looked for items 

a items. The best measures 


whi 
by correlate with all other types of mental-test it 
& or general ability according to Spearman's research, were abstract 


reasoning problems. Like Binet before him, Spearman studied his items in 
an attempt to formulate a definition of what his test measured. He con- 
cluded that ø consists of facility in “apprehension of one’s own experience, 
iin eduction of relations, and the eduction of correlates" —i.e. in making ob- 


Servati "iis 
ations and extracting general principles. 


here specific abilities tend to cancel 


AH 
©mogeneous Test: Matrices 
inique for measuring men- 


The m ; lar 
atrix i : i most popu ar 
tal ability te ni ne c : an in the United States. This 
d i » 
wem was -—| by L. S. Penrose and J. € Raven in England and pub- 
ished as Raven's Pr y woe ; Matrices te . 1938. Raven, following Spear- 
m's Progressive © i ive relationships. The 


according to one principle, 
The subject must identify these principles 
ün 
“pply the termine the needed design- 
The matrix m to s 9" highly flexible. The possible range of difficulty is 
“Normous Ss SECUS e is £ les given. The test may be adins- 
, as can be seen 1 


“ted individual] in groups and may be speeded or given with liberal 
Ime Allowance de aoe Ies mature subjects, the items can be presented 
í ubject a 


aS à seri tually chooses a block and fts 
;. 3 Series of f the S Ga) 

iti ormboards where : 

: into the blank space. The directions are very simple, so that verbal under- 
tandin "m sa $ Tndesd with very easy jnitial items, the test can be 
administer m — sex iso that the verbal element is entirely eliminated. 


G 
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ome form of. the matrix test is used widely. Since items 
NO ME 


tests af 
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ents lor special courses,» 


in advance. The disadvantage of this variation is the lack ot 


H is 
ive Matrices ! 
data on any one form. The published version of the Progressive ! 
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M 


FIG. 35. Matrix item 
the Progressive Matric 


free-response form, designed for testing college graduates, 


available in the United States, but its 
the group test are based on 1407 child: 
militiamen and 2192 civilians (in five. 
and no further description is given. F 


are rather easy 


hos? , 
like !! ix in 
5 are tri 
s at three levels of difficulty. The first and second e difficult a 
es, being of the usual difficulty. The third item is a very 
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ren (in half-year age groups lis 
-year groups). These are ie 
or general clinical and educ 


n 
h case? 
] test 


GROUP TESTS OF GENERAL ABILITY 217 


e to acceptable American norms is a se- 


bility to compare a given cas 
of job applicants, of course, | 


kawback. For selection within a group 
SRE KA AE RACE 
e MR ENS S 


Wess pun with SB is dibout G0 (her, SIS) Bor UMM sane, 
(J. Hall dee ormance correlates .70 and Wechsler Verbal .58 with Matri wi 
; > +991; see also Martin and Weichers, 1954). The subtest havie tle 
— with Matrices is Block Design. Obviously, the matrix 
he Binet nea independent of the educational attainments which affect 
the Verbal score, though Ombredane et al. (1956), in study- 


NOS Maden ie 
New, eN@logedl Nissen, Ness, found that test sees were affecte by 
Noe 


are re] 


WAR eRe 
tin Tanew saggests That in aae, teing The MAA MEAS, 
attainm 7 Wen new problems be supplemented by atmenome ok vost 
z) » Suc "nS v m " : yi 
Who efficic ch as vocabulary. This is especially important for person 
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te Ra ney is impaired by old age, emotional disturbance, etc 
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aini "» al abilities proved to be better predictors of TAE a P. 
n Ourses, Specialized spatial-mechanical tests such as the pinto 
Job Y made a better contribution to prediction of a in mechanical 
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; eas 
No one form of the matrix test is used widely. Since items e» woo 2 
to prepare, psychologists in all parts of the world have deve epa hie 
their own. These versions can be used for hiring employees or se pon tae 
dents for special courses, with little fear that the items will beco enit 
in advance. The disadvantage of this variation is the lack di 5 
data on any one form. The published version of the Progressive Ma 


in 

ike thos? , 

ix i z items are like UT, in 

FIG. 35. Matrix items at three levels of difficulty. The first and second i Im 


F difficu 
the Progressive Matrices, being of the usual difficulty. The third item is a very 
free-response form, designed for testing college graduates. 


available in the United States, but its st 
the group test are based on 1407 childre 
militiamen and 2192 civilians ( 
and no further description isg 
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ing, inability to compare a given ca i i i 
rious peer, et F s ee aa ^ mame m rum LUE 
E job applicants, of course, 
norms are of very little importance. 
non NS M rores Bp ew rue tests. For chil- 
i mul ue mena hi 7 . E about .60 (Keir, 1949). For an adult sample, 
a rrelates .70 and Wechsler Verbal .58 with Matrices 

(J. Hall, 1957; see also Martin and Weichers, 1954). The subtest having the 
highest correlation with Matrices is Block Design. Obviously, the matrix 
items are relatively independent of the educational attainments which affect 
the Binet and the Verbal score, though Ombredane et al. (1956), in study- 
Ing underdeveloped African tribes, found that test scores were affected by 
level of education. Raven suggests that in practical testing the matrix meas- 
ure of ability to solve new problems be supplemented by a measure of past 
attainment, such as vocabulary. This is especially important for persons 
whose efficiency is impaired by old age, emotional disturbance, etc. 

The Raven Matrices were adopted as the principal test for military clas- 
Sification in Great Britain during World War II. This nonverbal test was 
chosen to make sure that normally intelligent recruits were not rejected be- 
Cause of poor education. The fact that the matrix is so nearly a pure test of 
One ability limited its military usefulness. Tests combining general, verbal, 
and numerical abilities proved to be better predictors of performance in 
training courses. Specialized spatial-mechanical tests such as the Bennett 
Benerally made a better contribution to prediction of success in mechanical 
jobs. The matrix test was most helpful in predicting performance in visual 
Signalling and radar operating (Vernon and Parry, 1949, pp. 235, 244), Ex- 
Perience such as this has led the practical tester to give more attention to 
the Specialized abilities than formerly. A relatively pure measure of g is not 
a composite of g with verbal, spatial, or the 
ourse or job to which the person is assigned. 
as highly successful was that it called for 
n that schooling itself 


usually as good a predictor as 
other abilities required by the c 


One of the reasons Binet's test w 
the Specialized abilities in about the same combinatio 
did. It is therefore a better predictor of school adjustment than a purer test 


Might be, 
" The purely nonverbal score, howeve 
esting, It calls attention to pupils who 


a " ; jai 
te below standard in reading and ver 


"cured by a test that mixes verbal and nonverb 
thus the school overlooks children who could do much better work if given 


Suitable help. The nonverbal test is also useful in employee selection where 
“ange of educational background is wide. Among African tribesmen trained 
to operate heavy mining machinery, the matrix test predicted performance 


r, has one special function in school 
have good reasoning ability but who 
bal development. Such cases are ob- 
al components together, and 
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ratings with validity .51. This coefficient was based on performance after 


two practice tests, which proved more valid than measurement without prac- 
tice (Ombredane et al., 1956). 


An Omnibus Test: Kuhlmann-Anderson 


American group tests have been influenced little by theories about intelli- 
gence. Instead, they have been developed pragmatically, by trying stems 
and retaining those which correlate with such criteria as school success OY 
job success. 

One of the important American group tests is the Kuhlmann-Anderson In- 
telligence Test series. Most group tests are printed in a single booklet, with a 
different booklet for each three-grade range from kindergarten to adulthood. 
The Kuhlmann-Anderson, however, has nine different booklets, so that © 
class can be given a version of the test closely fitted to its ability. The pupils 
who do well on this booklet may then be given the next higher test, 4? 
those who do badly can be given the next easier test to obtain a more accu- 
rate measurement. Under this plan few pupils encounter items where they 
have to guess, and the test is shorter because unnecessary easy items xi 
eliminated. ; 

The development of this scale is characteristic of the procedures used = 
the older group tests of general ability. Beginning in 1916, Kuhlmann begat 
a tryout of items for use in state institutions in Minnesota, and in 1919 m 
(with Dr. Rose G. Anderson) to prepare formal tests. In the next four ipee 
more than 100 varieties of items were tried out; 51 seemed promising ume 
for further use. Four more years of research led to a selection of 35 subt = 
for the published scale. The scale then passed through five further edition 
sometimes with minor changes of content to replace unsatisfactory ER Je 
to extend the range, sometimes with modification of norms or format. x- 
thors of present-day tests often employ a similar procedure but use pee 
perience of previous investigators to shorten the research. er- 

The scale now contains 39 tests, organized in booklets which partly S dà 
lap each other. Thus booklet K (kindergarten) includes tests 1-10; A m i 
1), tests 4-18; B (Grade 2), tests 8-17, and so on up to booklet G for um 
7-8 and booklet H for Grades 9-12, F igure 86 shows representative test in 
Many of these item types are used in other group tests. Each test con 
at least eight items. A time limit is set either for each item or for th strö. 
test. These limits are liberal in the first two booklets, but later Jevels intr 
duce a substantial degree of speeding. they 

Nearly all of the subtests require adaptation to new situations, yet vyer- 
also depend on experience and many of them involve special abilities ( 
bal, spatial, etc.). Kuhlmann and Anderson followed Binet's prinsip " 


sists 9 
e sur” 
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combining such a great variety of tests that no one specialized ability plays a 
large part in the score. Verbal ability is important because the pupil must 
d the test designers use simple vocabulary, intro- 

eading only in the later tests, and even at advanced levels use only 


short: Auc " , i 

ort and familiar words. An example of ingenious testing is test 8, Counting, 
> 

ith numbers without using the for- 


comprehend directions, but 


whic - . 

hich measures judgment and accuracy W 

m: - i 
al number system or other school lear 


ES DO 
e A US 


ning. 


T : 
est 2. Picture errors. Test 8. Counting. 
“Put as many dots in the 


IN 
Put a dot on the 
box as there are balls” 


part that is wrong." 


Test 10. Copying. 
lines is held up before the class 
for 10 seconds) 


old rich wide 
poor green full 


Test 23. Opposites. 
"Find the two 
opposites.” 


inaudible distinct 


deafening faint loud 


Test 32. Arrangement. 
b. these were arranged 
in order, which would 
be the middle one?” 


F 

pid 36. Representat 

"i simpler and more 
2, Personnel Press, In 


ai sa which have substa 
lon of o subtests with low corre 
one with each ot 

were preferred, 50 


(The square with 


ive Kuhiman 
complete t 
c, Reproduce 


ringi x c 
ging in many aspects of 


Test 5. Pattern Completion. 
“put in the stick that is left 
out of the second figure.” 


top rattle doll 
sled playing 


robin winter horse 
song squirrel fence 
Test 26. Similarities. 
“Find the three things 
which are alike.” 


Basket 
Picture 
Test 34. Directions. 
“lf the word contains E 
but not R nor | write 
3 after it." 


n-Anderson items. 
hon these abbreviate! 


d by permission.) 


ntial correl 


Jations. 
e havin, 


as to jncree 


The directions us! 
d quotations sugges 


ation with age 
A second crit 
g lo 


ease the compre. 
ourse, is Jus 


w corr! 


Test 21. Classification. 
“Find the one that does 
not belong with the others.” 


N-B-U-M-E-R 


Test 28. Anagrams. 
“You are given the first 
letter. Write the rest 
of the word." 


5681115 


Test 39. Number series. 
“Write the two numbers 
which should come next.” 


ed in testing are 


t. (Copyright 


were selected in pref- 
erion was the correla- 
elations with other 
hensiveness of the test 
t the opposite of 


220 ESSENTIALS OF PSYCHOLOGICAL TESTING 


de- 

the procedure used in constructing a homogeneous test of a m E 
i a 

fined ability. The Kuhlmann-Anderson measures substantially the same a 

as the Stanford-Binet. When errors of measurement are reduced by averag 


al- 
ing three trials of each test, SB IQ and Kuhlmann-Anderson IQ correlate a 
most perfectly (Dearborn and Rothney, 1941) 


1. What abilities does the Kuhlmann-Anderson test require that the Raven Matrices 
do not? five 

2. The Kuhlmann-Anderson norms are based on “15,000 cases from bag 
Minnesota, New York, New Jersey, and Pennsylvania communities . . . selec 


n tion 
in consultation with State Departments of Education." What further informa 
about the norm group would be d 


3. In the normative sample, the st 
11 points at ages 6—10. Why i 
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PROBLEMS OF DESIGN AND VALIDITY 


Dependence on Language 


The matrix test is entirely free from verba] content, and the rape 
Anderson uses predominantly nonverbal items. Many other popular ee 
however, are almost completely verbal. Vocabulary and verbal appe 
have always been found good predictors of school and college success ® 
arithmetic reasoning and number series are also popular item forms 


TABLE 24. Approximate Level of Read 


ing Ability Required to 
Comprehend Items of Gro 3 MNT T 


vp Mental Tests« 


First Items Last Items 
Kuhlmann-Anderson 


(booklets for Grade 6) 


(book 5.0 7.5 
Otis (Higher, Form C) 70 13.0 
Terman-McNemar ; 

a. Information 5.5 10.0 
b. Logical selection 9.0 17.0 
c. Analogies 60 7.0 
d. Best answers 9.0 9.0 
Henmon-Nelson (Form A, Grades 7-12, i 


nonarithmetic items) 


6.5 14.0 


bj nce 

i mining readability from sente: 3 

D is the performance of the average seventh-grader. 
> 1950. 


re language is continually m ‘lity: 
re as a sign of lack of ment: 
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To avoid such errors, most testers prefer to use a test which includes both 
verbal and nonverbal items. Sometimes the two types are included in a sin- 
gle score, as in the Kuhlmann-Anderson. Sometimes separate scores are ob- 
tained, as in the Raven Matrices and the vocabulary test recommended to 
accompany it. 

It is wrong to assume that a test which requires no reading is independent 
of language abilities. The directions are almost always verbal, and not al- 
ways easy to comprehend. Sometimes the solution to a problem such as fig- 
ure analogies or matrices requires complex symbolic reasoning with abstract 
concepts. The person almost certainly relies on his vocabulary to arrive at 
the answer. A person whose language experience has been limited (e.g. a 
deaf child or a bilingual) is likely to be handicapped on some of the so- 


called nonlanguage tests. 


Comparability of Scores 
ps norms for his tests using his own standardizing 


group. It is particularly important to realize than an IQ of a given size has a 
different meaning in different tests, or on the same test at different ages. Ina 
recent study over 2200 9- and 10-year-olds took four prominent group tests. 
Where the Stanford-Binet distribution indicates that about 220 should have 
IQs 120 and over, the Kuhlmann-Anderson showed such IQs for 187 chil- 
dren, and the Henmon-Nelson for 524 children. At the low end of the scale, 
where 220 are expected to fall below IQ 80, Kuhlmann-Anderson reports 58 
and Henmon-Nelson reports 119 such cases ( Eells et al., 1951). 

In another study, Lennon compared three tests on equivalent samples and 
determined what raw scores on the three tests were comparable (Lennon, 
1952). These scores were then converted to IQs and MAs, using the tables 
from the test manuals. He found, for example, that an IQ of 130 on the Ter- 
man-McNemar is earned by the same pupils who would earn 128 on Otis 
Gamma and 126 on the Pintner Verbal. An MA of 14 on the Terman corre- 
Sponds to 19-9 on Otis and 13-6 on Pintner. Obviously differences in stand- 
ardizing samples cause IQs on some tests to average higher-or to spread out 
more than on others. Another source of variation is the use of now-obsolete 
Statistical techniques intended to yield ratio IQs. As all En rm z the 
Use of standard scores or percentiles within age groups, comparability o tests 


Will depend wholly on the adequacy of norming: 


Each test maker develo 


Degree of Speeding 
OST i ime limit. Whether an ability 

M y " bility are given with a time de 
fes ris SIE. b e uable. The time allowed for an ability test may 
tid. Despre almost entirely by speed of work, 


? ined 
€ so short that standings are determine 
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or may be so liberal that everyone finishes. Most tests present items in order 
of difficulty so that each student encounters the items he can do, and only 
the best student is pinched by the time limit. Table 25 indicates the effect 
on score when pupils are given added time on three typical tests. Evidently, 
most pupils finish in the standard time all the items they can do. For occa- 
sional cases, of course, speed will still be the principal factor determining 
scores. 


TABLE 25. Effect of Giving Pupils Additional Time on Group Intelligence Tests 


Mean Points Mean Points 
Earned in Earned in 


Age of Number of Standard Extra Standard Additional 
Pupils Pupils Test Time Time Time Time 
9-10 223 Otis Alpha 20 min. — 30 min. 65.0 11 

Non-Verbal 
9-10 226 Henmon-Nelson 30 min. 20 min. 54.1 3.4 
13-14 235 Otis Beta 30 min. — 15 min. 60.4 09 
(verbal) 


Source: Eells, 1948. 


Formerly a distinction was made between “speed tests” (time-limit tests) 
and “power tests" ( work-limit tests ). A test with a time limit, however, does 
not necessarily depend on speed. To decide whether a time-limit score de- 
pends on speed, we would need a special experiment. We would first give 
the test in the usual manner, obtaining score x, and then allow enough time 
for everyone to finish, obtaining the unspeeded score y. If most persons have 
the same relative standing on x and on y, the added time made little differ- 
ence and the time-limit score depends on the same abilities as the untime 
score ( Helmstadter and Ortmeyer, 1953; Cronbach and Warrington, 1951). 
In the Kuhlmann-Anderson manual an experiment is reported in which chil- 
dren were allowed to complete the test after time had been called, using ? 
second color of pencil so that both timed and untimed scores were available: 
The two sets of scores correlated as follows: in Grade 8, .74; Grade 5, 83; 
Grade 7, .87; Grade 9, .93. Perhaps it appears that these correlations are so 
high that the two tests measure the same thing; but making allowance for 
the nonindependence of the two measures, we estimate that in Grade 5, for 
example, at least 31 percent of the test variance is due to speed. S 

We cannot tell whether this is an advantage or a disadvantage in predic- 
tion without knowing whether the criterion task calls for speeded petai 
ance, and what type of speed it calls for. When the criterion task does ii 
demand speed or demands a type of Speed not involved in the test, qu 
ing the test introduces an irrelevant variable. For general academic criteri. 
a measure of power independent of speed is more relevant than a speede 
score. With a long testing time, an unspeeded test is more valid than a spee 
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test covering the same material. If only a short time is available for testing, 
however, a speeded test will be more reliable than an unspeeded test con- 
taining very few items. As a result, the short speeded test has greater predic- 
tive validity than the even briefer test that everyone can finish in the same 
time (F. M. Lord, 1953). 

The trend in recent American tests is to provide ample time for nearly 
everyone to finish. This point of view is not universally accepted. Eysenck 
(1953) and Fumeaux, in England, argue that the speed with which the 
mind produces hypotheses is the essence of good problem solving, and that 
à speeded test is therefore the best measure of mental ability. 


Stability 


Test scores are unstable when behavior patterns are being acquired, and 
a pencil-and-paper test score to be unstable in the earliest 


ability of the Detroit First-Grade Intelligence Test is 


‘91 by a split-half method, indicating good accuracy, but is only .76 when 
a retest after four months is given. Seagoe (1934) gives data on repeated 
measurements of various groups after a two-year interval, as follows: 


Detroit First Grade at age 6-4 with Detroit Primary at 8-8 r= 64 
Detroit First Grade at age 6-8, with Haggerty Delta at 8-8 r= 66 
Detroit Primary at age 8-9 with National Form Bat 10-8 r= T3 
National Form A at 10—4 with Terman at 12-5 r= .80 
National Form B at 10-7 with Terman at 12-6 r= 87 


we would expect 
school years. The reli 


Predictions of intellectual performance over short intervals of time can be 
made with substantial accuracy, but the mental test permits only approxi- 
mate long-range predictions in the lower grades of gekoni; Allen a 
reports, for example, that the Kuhlmann-Anderson IQ in the ey e of 
Grade 1 predicts achievement early in Grade 4 with a yalidity of .52. 
Once the initial development of reading and seatwork is past, group tests 
for successive ages measure about the same thing and do so with consider- 
able stability, as Seagoe's data show. By adolescence, scores appear to be ex- 
tremely stable. H. E. Jones reports that scores at age 17 on the Terman 
Group Test correlate .84—.90 with retests at age 33 (J. E. Anderson, 1956, 
P. 159). Despite this stability, the tester should not rely on an old mental- 
test score when a critical decision is being made. Some young people make 
substantial changes in mental performance over a three-year period. 


Overlap with Achievement Tests 


The Kuhlmann-Anderson, like most group tests of general ability, is closely 
related to educational status. According to the test manual the test distin- 
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guishes sharply between accelerated and retarded children in the same 
grade. The concurrent correlation of the test with total score on an educa- 
tional achievement battery, with age held constant, is .84 in one study, 17 
in another (Hilden and Skeels, 1935; Allen, 1944a). This correlation is higher 
than that of the Stanford-Binet with achievement. 

Some experts argue that it is impossible to separate aptitude or intelli- 
gence from achievement. If intelligence and achievement tests measure the 
same things, we are only fooling ourselves by giving them different labels. 
We can examine this criticism by making use of two statistical principles: re- 
liability is the proportion of test variance that is nonerror variance; the valid- 
ity coefficient squared indicates what proportion of test variance measures 
the same attribute as the criterion. If the reliability is .86' for the Kuhl- 
mann-Anderson, and the intercorrelation of test with achievement is .84, we 
arrive at Figure 37. The square of the intercorrelation indicates the overlap 


\\ 


Overlap with SS 
Achievement Measure 


S (71%) 


Unreliability 


SS (14%) 


FIG. 37. Overlap of the Kuhlmann-Anderson test with an 
achievement measure. 


in variance. Seventy-one percent of the test variance represents what the 
achievement test measures; 14 percent is error. Therefore, 15 percent of the 
Kuhlmann-Anderson test variance is due to some reliably measured ability 
independent of achievement. The test would report some differences among 
children having identical school achievement, but among children with the 
same achievement about half of the individual differences in IQ are due only 
to random errors. A similar conclusion would hold for other heterogeneous 
group tests of mental ability (Coleman and Cureton, 1954). For most chil- 
sh 


o» 


1 A coefficient of equivalence computed from subtest intercorrelations for Grade 
a modified analysis-of-variance method. The coefficient reported in the manual, - 
based on a split-half technique which should not be applied to speeded tests. 
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iction that a comprehensive 
dren, a group mental test leads to the same predictio 
achievement test would. 
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(Stewart, 1947). 
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pations have greater average ability, there is an extreme range within each 
occupation. 

Just what will happen to a boy with a given IQ is difficult to predict. He 
may do well in school and college and enter a profession, or he may drop out 
of school and remain in an unskilled job. Figure 39 is based on unpublished 
data from a follow-up study of students who graduated from high school in 
Flint, Michigan, in 1943. Ten years later the investigator located and ob- 
tained information about the subsequent careers of 97 boys. The figure shows 
what happened to boys at each ability level. 

The boys are divided into three levels according to Kuhlmann-Anderson 
IQ in Grade 9. For each group, the figure shows high-school grades, college 
history, and occupation ten years after graduation. The chart merits detailed 
study; it indicates the uncertain predictive value of high-school test scores, 
while at the same time showing that they have a definite relation to future 
success. We shall mention only a few of the relations that can be traced in 
the figure. There is appreciable correspondence between IQ and grades; 
practically no one in the lowest IQ level earned superior marks. Boys in the 
lowest group did not enter college unless their grades were exceptional, and 
they were more likely than the other groups to be in unskilled jobs. About 
one-third of the group with IQ 90-104 entered college, and half of them 
graduated. The group with superior IQs earned better high-school marks, 
but they did no better in college than students with average ninth-grade IQs 
and similar high-school grades. Moreover, the occupational status of high 
and middle IQ groups who went to college is the same. Among those who 
did not go to college the occupational level corresponds somewhat to IQ. 
The most striking finding is that, regardless of IQ or high-school average e 
ery student who finished college was in an upper-level occupation ten years 
after completing high school. The predictive significance of a ninth-grade 1 
would differ somewhat in other times and other places; it would be desirable 
for any high-school counselor to perform his own follow-up study in order to 
establish expectancies for his school. 

The fact that boys with IQs below 100 can succeed in college is hard to €x- 
plain in any general way, but the individual cases often are quite under- 
standable. Alex, though he had an IQ of 93 in Grade 9, eventually becam? S 
lawyer. The IQ was not inaccurate: he had 93 on a retest some months ur 
and 118 on the Stanford-Binet. Alex had lived in a boarding home during vid 
early school years following the death of his mother, and suffered mom 
sense of inadequacy which led him into aggressive, offensive behavior. 5 
counselor felt that Alex had ability even though his tests and grades we : 
poor. Under the counselor’s friendly encouragement he improved his 2a : 
to the B level and transferred to a college-preparatory curriculum. His P. 
sonal adjustment also improved. After war service Alex entered college ? 
completed his law course successfully ( Cantoni, 1954). 


IQ 105 and Above (30 Cases) 


High-school 
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IQ Below 90 (18 Cases) 
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7 high-school boys. (Data supplied by Dr. Louis 


js 39. Educational and occupational history of 9 
* Cantoni; see Cantoni, 1955.) 
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The rather large number of "late bloomers" like Alex warns against making 
a definite and final separation between students with high and low ability at 
the start of high school. It is hard to teach complex ideas to dull pupils, and 
their presence in the mathematics or French class will impede the teaching 
of the ablest. Many potentially able students, though, will not be recognized 
in the ninth grade. Any grouping plan must make provision for the student 
whose ability is discovered midway through high school. He must be able to 
fulfill college requirements without too much loss of time; otherwise, much 
of his talent will be wasted. 

The facts so far presented bear on differences between occupations but 
not on those within occupations. Many validation studies are available for 
specific vocations. One especially interesting result comes from a follow-up 
of workers in the home office of an insurance company. Nearly 700 workers 
hired between 1937 and 1949 were tested on a short general mental test at 
that time. New workers enter in the lower job categories and are promoted 
as their performance shows merit. The correlation between responsibility 
held in 1954 and score at time of hiring was .60. Fifty-four percent of those 
in "decision-making jobs" had had scores of 190 and over; only 5 percent with 
scores 0-99, and 19 percent in the 100-119 ran ge, held these high-ranking jobs 
(Knauft, 1955). 

Ghiselli (1955) reviewed the entire literature on prediction of success of 
workers and found that the group mental test predicts both training and per 
formance criteria for many jobs. The coefficients for any job title range from 
very high to negligible, depending upon the range of ability in the group 
tested and the demands of the specific job, Average validities for group men 


tal tests against job proficiency fall in the following ranges: 

00 to.19 Sales, service occupations, machinin 
pers, repairmen 

.20 to .84 Supervisors, clerks, assemblers 

.35 to .47 Electrical workers, man. 


: wrap" 
g workers, packers and P 


agerial and professional 


Somewhat similar results are reported by the USES. Correlations for general 
mental ability are above .40 for success of automobile mechanics, key-pun^, 
operators, practical nurses, and bindery workers, for example. In contras > 
correlations are below .15 for electronics parts assemblers, welders, pe 
decorators, and meat-packing workers (Guide to the Use of GATB; 1958): 


6. Characterize the occupations for which general ability is a good predictor: 


NOTEWORTHY GROUP TESTS 


. f the 
The tests listed below are a representative sample including some 9 t 


ed Austr 
good current tests, tests primarily of historica] importance, and tests illus 
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ing novel measurement techniques. The descriptions are designed to indicate 
some of the ways the tests differ rather than to provide a full review of im- 
portant qualities, and the reader should obtain fuller information from the 
test manual and from reviews of any test that interests him. 

® American Council Psychological Examination (ACE); L. L. and T. G. 
Thurstone; Educational Testing Service, 1924, frequent revisions. For college 
entrants; high-school form also available. This test was formerly the principal 
instrument used in testing college freshmen and in research on college suc- 
SESS. The total score predicts grade averages, usually with validity about 
45. There are two part scores: L (linguistic), based on vocabulary, verbal 
analogies, etc.; and Q ( quantitative), based on number series, figure anal- 
as found that the L and Q scores pre- 


Ogies, etc. No consistent evidence wa 
dicted success in verbal and scientific subjects, respectively, as intended 
s were of little value and 


(Berdie et al, 1951). Because the part score 
the many subtests awkward to administer, the test is now being supplanted. 


9 Army Alpha Examination; various authors, revisions, and publishers; 
Currently distributed by Western Psychological Services, Psychological Cor- 
Poration, 1916, 1939, et seq. For secondary school and adult use. Originally 
designed for Army group testing, the test has several speeded subtests call- 
ing for information, reasoning, and practical judgment. Has no advantage 


Over more modern tests. 

© California Test of Mental Maturity (CTMM); E. T. Sullivan, W. W. 
Clark, E, W. Tiegs; California Test Bureau, 1936, 1957. Levels from kinder- 
Barten to adult. One of the most widely accepted current tests, with unusual 
Variety of items, good format and standardization, and a continuous series 
of levels, The full test requires over one and one-half hours at school ages. 
There is a Short Form for use where less reliable measurement is acceptable. 
Separate "Language" and “Non-Language” IQs are offered, but there is little 
evidence to indicate the practical significance of differences between the 
two IQs. Subscores for memory, logical reasoning, etc., attempt to provide a 
Profile of abilities, but these subscores have dubious validity and should be 
Biven little attention. By standardizing CTMM along with the California 
Achievement Tests, the authors provide for comparison of the pupil's attain- 
ment scores with the expectancy for his IQ level (see p. 387). 

® College Qualification Tests; George K. Bennett and others; Psychologi- 


cal Corporation, 1957. College and precollege. An eighty-minute test de- 
Signed for measuring general scholastic p 


romise of college applicants and 
students, In addition to the total, the part scores measure verbal ability, nu- 
and information in three fields. 


This, like SCAT, is essen- 
ttainments. Norms are provided for various 
r years of hi 


teri 
"tical reasoning, 

gh school. The total score pre- 
jidity often being à 


tia 

d à sample of educational a 

Fw of college and also for uppe 
S freshman grade average, Và 


bove .60. 
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e Concept Mastery Test; Lewis M. Terman; Psychological Corporation, 
1939, 1956. College juniors and above. An untimed test designed to measure 
the highest ranges of vocabulary and verbal reasoning. 

© Cooperative School and College Ability Tests (SCAT); anon.;* ETS, 
1955. Grade 4 to college. This test is offered to replace the ACE as a device 
primarily for predicting academic success. A Verbal score measures vocabu- 
lary and reading comprehension; a Quantitative score measures arithmetic 
reasoning and understanding of arithmetic operations. Both measure school- 
learned abilities. The tests are well prepared, but validity, stability, and dis- 
criminating power in exceptional groups have not yet been thoroughly in- 
vestigated. They should serve well for selecting potential college-goers, 2? 
appear highly suitable for use by educators not psychologically trained. 

9 Culture-Free Intelligence Tests; R. B. Cattell; IPAT, 1933, 1944, 1950. 
Age 4 to adult, 3 levels. A nonverbal test including matrices and other rea- 
soning tasks with geometric figures. The test is independent of language skill 
but is not truly free of cultural influences. Norms for the test are unsatisfac- 
tory; IQs have a very large s.d. 

e Davis-Eells Games; Allison Davis and Kenneth Eells; World Book, 
1953. Grades 1-2, 3-6. Items are designed to be interesting and fair to lower- 
class children (see below). Problems are presented pictorially rather than 
verbally, and the test is relatively difficult to administer. Though the test 5 
long, the reliability is lower than that for competing tests. The test does not 
predict academic performance under present teaching methods as well as 
verbal tests do, but is designed to locate children for whom new teaching 2:3 
proaches are needed. 

9 Henmon-Nelson Tests of Mental Ability; Tom A. Lamke and M.J. Nel- 
son; Houghton Mifflin, 1931, 1950, 1957. Grades 3-6, 6-9, 9-12, college 
thirty-minute test of the “spiral omnibus” pattern in which various item type 
are presented in rotated order with a steady rise in difficulty. Items inclu 3 
information, proverb interpretation, figure analogies, following direction? 
etc. Carbon-sheet method of quick scoring. The 1957 revision is well m 
signed as a short measure of scholastic ability having reliability over 90, be 
considerable overlap with reading ability and no diagnostic features. 

€ Kuhlmann-Anderson Intelligence Tests; F. Kuhlmann and Rose 
Anderson; Personnel Press, 1927, 1952. Age 6 to maturity. (See pP- 218 

9 Lorge-Thorndike Intelligence Tests; Irving Lorge and Robert L, Thor 
dike; Houghton Mifflin, 1954. Levels from kindergarten through high SEE 
A well-constructed test. At the primary level, questions requiring verb 
derstanding and reasoning are read by the teacher, and the pupil respo” 


f.) 


pn- 


2 « Pais ganz?" 

_ *An entry of "anon." indicates that the test was prepared by the staff of some ore gn : 
tion—in this case, the Educational Testing Service. The responsibility for test de B. 
shared so widely that listing the many codperating authors would not Te informative: 
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by marking pictures. In Grades 4 and above, nonverbal and verbal sections 
can be separately administered. The nonverbal items call mostly for general 
ability, independent of vocabulary and reading. Since the verbal and non- 
verbal scores correlate about .70, differences between the scores will not be 
significant for the majority of pupils. 

e Miller Analogies Test; W. S. Miller; Psychological Corporation, 1926, 
1950. Superior adults. A test of 100 very difficult verbal analogies items, ad- 
ministered only at licensed centers. Severe restrictions protect the security of 
items, since it is used by many graduate schools to test applicants. Sizable 
validity coefficients for predicting success in graduate study are reported, de- 
spite the narrow range of ability within which the test is designed to discrimi- 
nate. 

€ Ohio State University Psychological Examination ( OSPE); H. A. Toops; 
Ohio College Association, 1919, frequent revisions. SRA publishes Form 21 
(1940); the Minnesota Scholastic Aptitude Test is a shortened version of 
Form 23. High school, college. By restricting items to vocabulary and read- 
ing ability, the author obtains a score which predicts college marks with un- 
usual accuracy (coefficients of .60 are common ). The test requires about two 


hours. 
€ Otis Quick-Scoring Mental Ability T 
1936, 1954. Forms for Grades 1 to college. 


ment techniques. 
tems to obtain a quick measure of general abil- 


for other tests. The technical development and 
less adequate than for tests of recent origin, 
chool achievement compare favorably with 


ests; A. S. Otis; World Book, 1920, 
Otis was one of the first to experi- 


ment with group measure His tests generally combine ver- 


bal and nonverbal reasoning i 
ity. IQs tend to be lower than 
the manuals for these tests are 
but predictive validities against s 

Other tests. 
© Pintner General Ability Tests; Rudolf Pintner; World Book, 1931, 1945. 
Grades 4-9, There are separate language and nonlanguage tests, each requir- 
sts for lower grades, namely, 


ing about 45 minutes. There are companion te 
the Pintner-Cunningham Primary Test and the Pintner-Durost Elementary 
Test. The latter contains two subscores, one being based on verbal reasoning 


items read by the pupil. The other measures vocabulary and verbal reason: 
ing independent of reading skill by having the pupil mark pictures in re- 
Sponse to questions read by the teacher. The two sodes give significantly 
different information. The verbal test for intermediate grades is much like 


Other older tests. The nonlanguage test contains six subtests, some of which 
ng reasoning tests. The nonlanguage score adds more 
data normally available from achievement tests than 
usual omnibus intelligence test. 

H. K. Lewis (London), Psychologi- 


5% to 11; age 9 upward. One of the 


are ingenious and taxi 
Unique information to 
does the verbal test or the 

9 Progressive Matrices; J. C- Raven; 
cal Corporation, 1988, 1947, 1951 Ages 
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best available techniques for obtaining a nonverbal measure of reasoning 
ability, though the single type of item places a possibly undesirable emphasis 
on spatial reasoning. The norms are based on poorly selected groups. Reli- 
ability of the scale in single age groups, especially young ones, is inadequate. 
An efficient, properly standardized form is badly needed. 

e Scholastic Aptitude Test (SAT); anon.; College Entrance Examination 
Board, 1926 to date. This examination is administered in a controlled pro- 
gram to applicants for admission to affiliated colleges. Not sold for general 
use. Tests measure vocabulary, verbal reasoning, knowledge of high-school 


[je Um] [e2o2cce- 


- b 
FIG. A0. Nonverbal item for the Semantic Test of Intelligence. (Rulon, 1952. Reproduced à 
permission of Dr. Phillip J. Rulon.) 


, : s A " stems 
mathematics, and quantitative reasoning, combining comprehension ite 


with subtle reasoning in order to discriminate at high levels of ability. within 
the group surviving to the fourth year of college, SAT-Verbal correlates ^7 
with grade average in the typical college, and SAT-M ( Quantitative) an. 
relates .27 (J. French, 1957). It 
© Semantic Test of Intelligence; P. J. Rulon; unpublished, 1952. Ado" 
A nonreading test for testing conceptual reasoning, designed to determi 
which illiterates in the Army should have literacy training. (This is eh á 
an effort to utilize draftees who would otherwise have to be rejected.) wi 
man is taught by pantomime the meaning of certain symbols, and je x 
through a series of "decoding" problems, beginning with two-choice item. 
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In Figure 40, the upper panel shows that the first symbol stands for “cow”; 
the first phrase is “cow jumping,” and the fourth picture should be circled. 
Correlations show that this worksample of learning is more a measure of abil- 
ity to learn, and less a measure of attainment, than other mental tests (Rulon 
and Schweiker, 1953). 

e Terman-McNemar Test of Mental Ability; L. M. Terman and Quinn 
McNemar; World Book, 1940, 1949. Grades 7-12. A well-constructed 
test, reliable and easily interpreted. Restricted to verbal reasoning and in- 
formation in order to predict school marks. 
eling center, adults with varying educational backgrounds 


en advisement. Prepare what you consider a 
(group and individual) needed to cope with 


7. In an adult couns 
and vocational goals must be giv 
minimum list of intelligence tests 
all nonpsychiatric cases. 

8. Prepare a minimum list o 
who is expected to diagnose any 
work is considered unsatisfactory. 

9. Would a group or an individual test be preferable . 

a. in screening applicants for teaching positions in a large city? 
b. in testing juvenile delinquents prior to decisions about probation? 
c. in research on trends in the Loin, ip of tun grea? 
d.i ina secretarial employees for a university? 
10. iat terion would ko made if a child has Non-Language IQ 120, 
Language IQ 90? What interpretation would be made if the Language IQ 


were -Language 90? "m 
T1. In ul E test, pupils listen to a story about "The Pack Train. 
In the story, a man goes to a mining camp by pack train, passing a glacier and 
being threatened by a grizzly bear. After hearing the story, pupils go on to 
take other sections of the test. After an elapsed time of 25 m ien pupils 
are asked questions about the story. What does the test measure besides gen- 


eral ability in "delayed recall"? 


f intelligence tests needed by a school psychologist 
pupil, age 6 to 16, whose behavior or school 


USE AND FUTURE PROSPECTS OF ABILITY TESTS 


After a generation of enthusiastic acceptance, group tests of ability have 


come under attack from many quarters. One challenge comes from the yo 
lytic measures of differentiated abilities, which hope to offer ropes e- 
Scriptions of patterns of ability in place of the overall aber of Lir sari 
test. The other principal challenge grows out of the recogni eh " ten 
—at least from age 8 to 20—are strongly influenced by past school ac 


ment, 


Chief Functions of Group Mental Tests 

s who have been in the same class, the high cor- 
and achievement batteries means that 
ce they lead to similar decisions. 


If one is comparing student 
relation between general ability tests 
it makes little difference which we use sin 
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When one compares persons coming from different educational back- 
grounds, the general ability test is often much the more suitable because it 
is not matched to any particular educational experience. Among the impor- 
tant functions of the general mental test are these: 

9 Comparing pupils at the beginning of a school year. The mental test 
is fair to pupils coming from various schools, whereas an achievement bat- 
tery might not be. 

9 Decisions regarding the admission of college students. High-school 
grades or rank in high-school class usually predict better than mental tests, 
but it is hard to compare grades from different schools, especially small ones. 
A combination of high-school grades with a group mental test commonly pre- 
dicts college marks with a validity of .60 to .70 (D. Harris, 1940). Achieve- 
ment batteries can be substituted for the mental test as a predictor. Several 
studies of the Iowa Tests of Educational Development find validities of .60- 
-70 for the test alone (Using the Iowa Tests, 1957), but this test requires sev- 
eral hours whereas the usual college-level mental test takes from 45 minutes 
to two hours, 

€ Selecting employees. The fact that persons have different school back- 
grounds makes an achievement battery unsuitable, especially where the job 
depends little on school learning. General mental tests are often more ac 
ceptable to adult job applicants than a test reminiscent of schoolwork 
would be, 

9 In research, for dividing subjects into groups of equal ability so as » 
compare different methods of instruction or to study effects of motivation, 
etc. 


The Spectrum of Ability Tests 


In the descriptions of tests labeled as measures of general mental ability; 
scholastic aptitude, or intelligence, it is apparent that that name covers ? 
considerable variety of test content. Tests can be arranged in approximately 
the pattern shown in Figure 41, along a spectrum ranging from those whic? 
are strictly measures of outcomes of education to those which are most inde- 
pendent of specific instruction. For the sake of contrast, we anchor the pyra 
at the educational end (A) with tests of subject-matter proficiency whic 
measure how much the pupil knows about particular courses such as alohi 
and physics. A few tests (notably CQT) have measured information in 8 
ject fields as a part of “scholastic aptitude” tests. Tests of general education 
development are next in order (B). These tests, to be described in Chapte 
13, measure general abilities and study skills which might be acquired ir 
many different courses such as ability to interpret graphs and charts, abi ; : 
to comprehend and draw conclusions from scientific articles, etc. At C; ™ 
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tests measure educational proficiencies such as size of vocabulary and arith- 
metic reasoning; these intellectual tools are even more fundamental to intel- 
lectual work than those at B. At D we begin to move away from things di- 
rectly taught in school; the tests present puzzling verbal problems which 
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FIG. 41. Spectrum for comparing tests of scholastic aptitude. 


require the student to reorganize knowledge. The distinction between C and 
D tasks may be illustrated by vocabulary items A C d pM ask the 
meaning of a fairly rare word—for example, requiring ees o * e nearest 
Synonym for the given word, as in this Lorge-Thorndike item: 


subvention meeting support change criminal lessee 


A D item uses words known to most pupils in the grade tested but requires 
à difficult comparison. For example, the DAT Verbal Reasoning Test (Ben- 


nett et al., 1947) requires choice of two words to complete an analogy: 


- is to static as dynamic is to... . i 
1. radio 2. politic 3. = s -M 
A speaker B. motor C. rega l 


At E, we come to tasks which require reasoning with abstract concepts but 
5, We com a ; : hei Se 

Which require little if any familiarity with the examiner's language. er we 

Move mei F we attempt to emphasize concepts and experiences familiar 


i i iri ful reasoning. 
to eve bject, while still requiring care " 
inclu g “cone to correspond closely to E on this spectrum; it is con- 
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cerned with the ability of the individual to form or detect relationships and 
abstractions. Most group tests combine tasks at levels D and E, sometimes 
in separate verbal and nonverbal scores. Many recent tests, particularly for 
college-level students, move up the spectrum to B and C levels, thus becom- 
ing more directly a reflection of how well the student has done in past school- 
ing. Tests for the primary levels, on the other hand, are necessarily concen- 
trated at E and F. A few tests have succeeded in developing items of type F 
for older subjects. 

The functions of the tests at the two ends of the spectrum are different. 
Those toward the top are designed for cold-blooded prediction of future 
school success. One who has done poorly in past schooling is a bad bet for the 
future, no matter what his "intelligence" may be. Those who admit students 
to college or award college scholarships rarely take a chance on the student 
“who would succeed if he turned over a new leaf.” They prefer the test which 
deliberately handicaps the student who has had poor schooling or has taken 
little advantage of it. On the other hand, the teacher and counselor working 
with a student wants to know what undeveloped resources he has. They can 
rely on past achievement for an estimate of probable future accomplishment 
when nothing out-of-the-routine is done for the student, but the mental test 
ought also to locate undeveloped potential that novel treatment may bring 


s 
out. For the latter purpose the most information is provided by tests at » i 
ts in Ul 


E and F which have a minimum of overlap with achievement. Tes 
persons 


range from D to F are preferable when it is necessary to compare if- 
coming from different educational and cultural backgrounds. The more c! 
ferent the backgrounds, the farther toward F the test should be, unless n 
criterion task requires some particular background. 


n ee 
12. Classify the Kuhlmann-Anderson subtests illustrated in Figure 36 on the SP? 
trum. 


The Shift Toward Achievement Tests for Academic Prediction 


d g n at 
It is evident that tests of types D and E have much overlap with tests 3 


B and C, but the effort of test developers, beginning with Binet, was J 
rected to measuring “mental ability” as distinct from achievement. negii, 
many test developers have come to the conclusion that this is not à pas 
endeavor when one’s aim is to predict school success. These workers ge ea 
recommending tests of types B and C for this purpose. The ACE test, fo for 
ample, was for a long time the principal instrument used by college’ rs 
measuring scholastic aptitude of entering freshmen. In 1955 its publisher 
acting on the advice of specialists in educational guidance, introduce? * 
place the School and College Ability Tests (SCAT). ese 
The SCAT is a measure of verbal and arithmetic comprehension; t 
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abilities played a large part also in the ACE, but that test included number 
series, figure analogies, and other reasoning tasks not directly taught in 
school. According to the manual ( Cooperative Test Division, 1955, pp. 8-5): 


The tests in the SCAT series have been designed and developed for 
the principal purpose of helping teachers and counselors—and students 
themselves—to estimate the capacity of each individual student [in 


high school and college] to undertake the academic work of the next 


higher level of schooling. . . - 
In considering the general purposes for which the SCAT series was 


and the continuity of measurement that was to be a 
velopment of the series, the Advisory Commit- 
tee recommended strongly that the new tests should measure "school- 
learned abilities" directly, rather than psychological characteristics or 
traits which afford indirect measurement of capacity for school learning. 
This recommendation was based on three general observations shared 
by all members of the committee: (a) that the best single predictor of 
how well a student is likely to succeed in his school work next year is 
“how well he is succeeding this year”; (b) that a certain few school- 


learned abilities appear to be critical prerequisites to next steps in learn- 


ing throughout the range of general education—among them skills in 


reading and in handling quantitative information; and (c) that school- 
learned abilities usually can be discussed with students and parents in 
a more objective way than can such emotionally-loaded characteristics 


as “intelligence” or “mental ability. 


to be designed, 
principal objective in de 


Demand for Ability Tests Independent of Social 


and Educational Background 
ses to abandon general ability tests as un- 


necessary, many others have taken the position that the solution to ^s no 
lem of overlap is to make general ability tests less dependent on backgroun 
even if this reduces their correlation with subsequent school Sudcens. 

There has been particular dissatisfaction with the tests for high levels of 
ability, most of which are measures of information at least as much as a 
are measures of thinking power. It is not a simple matter to steep d 
uous difficult items, and most testers raise difficulty by je : (o 
of speeding or by introducing items which Wa A cna á ge p 
Concept Mastery Test and the Miller Anp n es Sedis à P d q s 
the subject to have a very large vocabulary. The matrix a 


i it has been published. 
i i :ah-level version of it has ne 
7 ire Fa is an of present tests, whether group or individ- 
ual emn A ; å 1951). Efforts to test types af 
» to measure creative 


bility (Thurstone 


While one group of testers propo 
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thinking characterized by logic, accuracy, and knowledge have been very 
successful. There has been little success in identifying the types of thinking 
which distinguish the novelist from the engineer or the theorist from the mu- 
seum curator. This is probably of minor importance at younger ages where 
all abilities seem to be highly correlated, although even here we perhaps oc- 
casionally overlook a child who has high potential along lines not stressed in 
our tests. In vocational guidance, we have little basis for judging which man 
will be most insightful, most creative, and most original. 

A much more severe criticism is made by Allison Davis, who argues that 
use of tests dependent upon past schooling and school-related behavior 
denies many children a fair opportunity. Children who do well on mental 
tests are encouraged by teachers, and if they have trouble with school work 
a special study of their difficulties is made. If a child with a poor mental-test 
record, on the other hand, has trouble with schoolwork, the teacher is likely 
to accept this as natural and make no deeper inquiry. The child who could 
do better schoolwork than he has in the past is neglected just because the 
poor background lowers his mental-test score. ! 

Davis (1951) and his associates believe that American society contains 
several cultural segments, of which the largest and most distinctive are Lm 
middle class and the lower or "working" class. The former group, consisting 
of professional, skilled, and white-collar workers, values education as ? 
means of maintaining a desirable place in society. On mental tests, the aver- 
age middle-class child does better than the average lower-class child. Davi 
thinks that this social-class difference results from the way the tests are -— 
structed rather than from deficiencies in reasoning ability among the lower 


class children. Tests, he Says, are biased against the lower-class child (1951, 
p. 15): 


e 
The type of problem in present tests, which is clearly biased, may k 
illustrated by the following: 


A symphony is to a composer as a book is to what? 
( )paper ( )sculptor ( )author ( )musician ( )man 
On this problem 81 percent of the higher socio-economic 8! Num 
marked the correct response, but only 51 percent of the lower pie 
economic group did so. In an experiment designed by Professor -— 
Haggard we made a problem similar to that just read, but we T 


words and situations common to all social groups of children. This pro 
lem was read to the pupils: 


A baker goes with bread, like a carpenter goes with what? - 
( )asaw ( )ahouse ( )aspoon ( )anail ( )ame? 


" ic 
On this culturally fair problem, 50 percent of each socioecono™ 
group gave the correct answer. 
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This criticism of test content implies that some types of reasoning tests 
handicap the lower-class child more than others. There are two major stud- 
ies which bear on this contention, both of them carried out by associates of 
Davis. Havighurst and Janke applied individual tests to all the 10-year-olds 
and nearly all the 16-year-olds in a midwestern town. The average IQs are 
shown in Table 26. In this table, class C consists of families of white-collar 


TABLE 26. Average IQ of Middle- and Lower-Class Groups 


10-Year-Olds 16-Year-Olds 
Cornell-Coxe Porteus 
Social Stanford- Performance Draw- Maze Stanford- Wechsler- 
Status N Binet Battery — a-Man (MA) N Binet Bellevue 
c 26 114 116 107 128 44 112 109 
D 68 110 110 102 128 49 104 102 
E 16 91 96 91 10.4 13 98 103 


Sounce: Havighurst and Janke, 1944, 1945. 


Workers and small businessmen, D, of semiskilled workers and laborers, and 
E, of the lowest occupational groups and the down-and-outers. All the tests 
except the Wechsler-Bellevue show about the same difference between 
classes. The performance tests do not give a substantially more favorable 
picture of the lower-class child than does the Stanford-Binet. Whatever cul- 
tural handicaps or hereditary handicaps there may be seem to be present in 
both verbal and performance measures. The difference between Binet and 
Wechsler results for 16-year-olds is not large enough to require explanation. 
The second study (Eells et al., 1951) involved the administration of nu- 
merous group tests to a very large number of pupils. As expected, the scores 
correlated with social status (.33 for the Kuhlmann-Anderson ). This differ- 
ence was found on all types of items. Ninety-one percent of the items for 
16-year-olds, and 63 percent for 10-year-olds, were easier far the middle-class 
child. Although verbal items showed a slightly greater difference, this study 
also implies that the handicap of the lower class is not primarily a function 
of test content. . . 
Eells made another observation, however, which points to differences in 
test motivation as a possible source of inaccurate comparison. High-status 
Pupils tended consistently to select the most plausible incorrect choices 
(“distractors” ) whereas the low-status pupils scattered responses widely over 
all wrong choices. This seems to indicate that the lower group guesses more, 
and puts forth less effort on hard items. This problem of differential motiva- 
tion is one already mentioned in connection with study of national and racial 
differences. Tests are constructed to predict readiness for academic school- 
ing and therefore emphasize abstract ideas, careful šelfzoritieism; and will- 
ingness to work at a task which offers no visible reward. Davis’ case studies 


240 ESSENTIALS OF PSYCHOLOGICAL TESTING 


indicate that working-class children live in a world concerned with concrete 
problems where sound thinking and errors meet with tangible rewards and 
penalties. Havighurst comments as follows on the motivational differences 
between classes (Eells et al., 1951, p. 21): 


The characteristic middle-class attitude toward education is taught 
by middle-class parents to their children. School is important for future 
success. One must do one's very best in school. Report cards are studied 
by the parents carefully, and the parents give rewards for good grades, 
warnings and penalties for poor grades. Lower-class parents, on the 
other hand, seldom push the children hard in school and do not show 
by example or by precept that they believe education is highly impor- 
tant. In fact, they usually show the opposite attitude. With the excep- 
tion of a minority who urgently desire mobility for their children, lower- 
class parents tend to place little value on high achievement in school or 
on school attendance beyond minimum age. 

When the middle-class child comes to a test, he has been taught to 
do his very best on it. Life stretches ahead of him as a long series © 
tests, and he must always work himself to the very limit on them. ipe 
the average lower-class child, on the other hand, a test is just another 
place to be punished, to have one's weaknesses shown up, to be ea 
minded that one is at the tail end of the procession. Hence this child 
soon learns to accept the inevitable and to get it over with as quickly ae 
possible. Observation of the performance of lower-class children on 
speed tests leads one to suspect that such children often work pud 
rapidly through a test, making responses more or less at random. ki 
parently they are convinced in advance that they cannot do well à 
the test, and they find that by getting through the test rapidly they c2 
shorten the period of discomfort which it produces. 
and 
st 
e 


In an effort to provide a test which will not seem unduly abstract 
schoolish to lower-class children, and which will not penalize them for P. 
indifference to schooling, Davis and Eells developed a new group test. T es 
“Davis-Eells Games" require reasoning, but they deal with everyday b 
tions rather than abstractions (Figure 42). The items appear to be dt 
ing to pupils, appealing to much the same motivation as a comic strip ne 
fies. Just how well the test achieves its aim is difficult to judge from m 
evidence. It is much less dependent on reading than the Kuhlmann-Anc rd 
son, which is itself less verbal than many group tests (Love and PS un 
1957). One study finds that lower-class children lag just as far behind ^ r 
middle-class group on the Davis-Eells as on a conventional test, anot n 
finds a smaller correlation with social class for the Davis-Eells than for v 
conventional test, and a third finds just the opposite, the lower-class gr? 
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having a relatively greater handicap on the Davis-Eells (Coleman and 
Ward, 1955; Noll, 1958; Fowler, 1957). These discrepancies may be due to 
community differences, but they make it clear that we have much to learn 
about the implications of social class for test interpretation. 


"This picture shows a woman; it shows a man with a bump on his head; and it 
shows a broken window. A boy is outside the window. Look at the picture and 
find out the thing that is true. 


No. 1. The man fell down and hit his head. 

No. 2. The ball came through the window and hit the man's head. 

No. 3. The picture does not show how the man got the bump on his head. No- 
body can tell because the picture doesn't show how the man got the 


bump." 
(No. 2 is scored as right) 


ree packages home. Which boy is starting to load 


"Each boy is trying to take th e 
so he can take all three home? 
the packages the best way Per renia 


Eells Games. Questions below the figure 


the Davis- 
On (Copyright 1953, World Book Com- 


i items fr 
FIG. 42. Specimen i by the tester. 


are read aloud to the group 
pany. Reproduced by permission.) 
Placed alongside the arguments which led to is construction of SCAT, 
avis’ ngs. : t an issue iidden beneath previous testing prac- 
: arguments bring out * a adici Sears dias 
cig Ever i Binet. pa testers have tried i n i e auc i 
Once, Th e i: dict school success and therefore mnc : : jea o 
cin, Opes same tests to measur 
ducati y y da «tects, But they also ask the ‘i sure 
ional skill in their tests. of as distinct from educational 


a i ich is thought 
Psychological attribute which is ji A tied combination of predictive 


a H4 
ttainment. Most present tests are 
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measures which rest upon past achievement, of measures unrelated to either 
past or future achievement, and of measures which predict future per- 
formance but do not depend on past schooling. Despite their ambiguity 
these tests serve many purposes fairly well, particularly when other de- 
pendable information is lacking, as is the case in employee selection and 
recruit classification. Other information is available in most school counsel- 
ing, however; the potential user then ought to ask what the mental test offers 
that adds to the other data. This question forces the test developer to take a 
clear position. He can develop a superior measure of educated skills, or 
he can develop a superior measure of unschooled abilities. Either of these 
can add to the understanding of the pupil in a way that no poorly defined 
composite can. 

The argument for basing educational decisions (selection, placement, 
guidance) on achievement tests expresses a conservative outlook (Cron- 
bach, 1957). One who uses such tests takes as his task to predict who will do 
well in school and society as now constituted. In his eyes, the tests are un- 
fair only if lower-class children do better in school than their test scores 
forecast. Investigations show just the contrary: when test scores are matcher 
the middle-class children do somewhat better in school (Turnbull, 1951); if 
anything, the tests do not give the middle-class group enough advantage 
Stroud (1942) concluded that “for purposes of prediction of success id 
Schools as now organized, intelligence tests appraise the ability of un 
favored groups as fairly as they appraise the ability of the average OF en 
favored groups and . . . although the low average intelligence-quotient » 
the unfavored groups may be the fault of society or of biology, it is not due 
to unfairness inherent in the intelligence tests" (italics ours). The pars 
the school, and the world of business all demand working for remote gon? 
by means of abstract ideas, The usual group mental tests show how well ê 
pupil is likely to fit into that system. t 

Davis position attacks this conservative philosophy. He believes tha 
society should be fitted to the individual. If our schooling calls for thinking 
and motivational patterns that fit only the child of middle-class parents, w 
may be neglecting our responsibility to discover teaching methods that W? : 
bring out the ability of the lower-class child. Fundamental as this argum" 
is, it has limited practical significance at this moment, because neither paw 
nor anyone else has suggested what educational methods should be use 
with the lower class. When and if new methods for this purpose are foun”? 
the tests that predict success may be different from those now used by 
schools. 

Much theoretical knowledge is required on the relation of ability pattern? 
to choice of teaching method. Ideally one would like to put each person 
into that type of instruction where he will do best. There is a great need for 
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studies of the correlation of tests with success under various methods of in- 
struction. A test which correlates higher under one procedure than another 
is needed if we wish to allocate the pupil to the method best for him. 
There has been no systematic research on validity taking method of instruc- 


tion into account. 


13. In 1958, a committee of testing specialists made the following recommenda- 
tions (among others) to state school officials regarding desirable testing pro- 
grams (Identification and Guidance of Able Students, 1958). Classify the rec- 
ommended tests on the spectrum and indicate what published tests seem to 
meet these specifications. 

a. For selecting college scholarship winners, if there is a state-wide competi- 
tion, the final examination should measure use and comprehension of the 
English language, quantitative reasoning, and ability to handle problems 
of comprehension in basic fields of knowledge including the sciences. 

b. In Grade 6 or 7 there should be a scholastic aptitude test as little dependent 

on academic skills as possible—i.e., a reasoning test based on material not 

directly taught in school. In addition the test should probably yield a score 
based on verbal and quantitative material. 

In Grade 10 or 11 there should be a test oriented primarily toward pre- 


dicting college success. Its content should probably be in the region where 


aptitude and achievement merge. a 
14. What reasons can you give for or against the above recommendations insofar 


est and placement? 

boys tend to surpass girls on tests of mathematical 
n literary interpretation. How would you take 
test to be used in a state-wide public 


as they concern type of t 
15. Among college entrants, 
abilities and girls surpass boys i 
these facts into account in planning a 
competition for scholarships? 
16. ovem says (Donahue ef al., 1949, pp. 148-149): 
ting children of elementary school age has 
been upon diagnosing the causes of difficulties in specific aspects of learning. 
The development of diagnostic tests in various subject matter fields has shifted 
attention from the problem of predicting over-all ardomi s to the prob- 
lem of determining the causes of academic difficulties. Th e shift 4 en a 
a fortunate one since it is doubtful whether over-all predictions o E. ievement 
in elementary school are particularly useful except where extreme deviates are 
being considered." 


Do you agree with the last state 
tion about the pupil's MA, if it i 


"The most recent emphasis in tes 


ement? Can the classroom teacher use informa- 
s within one year of the group average? 


The Interpretation of “General Mental Ability” 
; tical implications of mental tests, 
Tl ; ion discussed the practical 
a “ty preceding dm iei should also consider the place of the concept 
di “g Seis ae pres m psychological science. 2 will iei oe es 
Origin p — dividual differences, tried by Ga ton a E «d 
al measures of indivi ch as speed of judgment. This work 


tell, isolated narrowly defined abilities su 
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produced completely miscellaneous instruments and results and seemed to 
have no relevance to the general problem of human intelligence. Binet 
started a revolution with his hodgepodge, complex instrument, and all the 
more recent testers of general ability have followed his banner. Did he really 
hit on the essence of mental ability? Or can we hope for some radically dif- 
ferent approach which will penetrate more deeply into the problem? 

The efforts of many psychologists have been directed to attempts to under- 
stand general mental ability. From one side, they study the tests and their 
correlations and try to infer what the common ability running through the 
test depends upon. From the other side, they examine thinking processes 
and try to explain what differentiates the more successful from the less 
successful thinker. 

The large amount of research based on tests has established a picture of 
the “general ability” of the tests. We see the tests as, first of all, a sample of 
performance in solving a standardized intellectual problem. While it is not 
a true sample of everyday life, it is nearly as complex as any practical task. 
Far more than the person’s “intellect” is involved. His effort and his success 
depend on his self-concept, his feeling about the authority who gives the 
test, his ability to tolerate stress and frustration, and many other qualities. 
The test, then, gives a picture of the adjustment of the total person t? a 
standardized situation making intellectual demands. 

The adjustment which the test calls for seems to involve the ability t° 
interpret a complicated stimulus situation, to test various possibilities mere 
tally, and to carry out a response which in some way “completes” the dd 
tion. It is evident that such interpretations are dependent on past exper! m 
Even in a strange problem like the matrix the person must select m 
elements and bring to bear abstract concepts previously learned. At ate 
same time, level of development no doubt depends on innate potential. pits 
mental-test score reflects present proficiency, i.e., the structure of ha e" 
and behavior processes which experience has molded out of the raw ™ 
terial heredity provided. 

The interpretations made of tests in practical work stem largely from 1 
Clinical orientation. It is not surprising that clinical workers should regar 3 
the test as a measure of the functioning of the total person rather than of in 
tellect alone, and consider this an advantage rather than a disadvantag 
General psychologists, on the contrary, have been asking just how man 
mind is able to interpret his world, and they have therefore tried to isol ý 
intellectual processes for separate examination. It is from such "pure. xi 
search on thinking processes that we today hear the strongest suggestion 
for new approaches to intellectual measurement. " 

The work of Piaget is representative of this trend. His work may be in 
garded as a direct continuation of the line of research Binet was engaged ! 
before he turned to making mental tests for the Paris schools. Binet had bee? 


ate 
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trying to understand how attention, memory, and other processes operate, 
and out of these experiments he drew his practical test procedures. Piaget 
has devoted his lifetime to the study of developmental changes. How, he 
asks, do perception and reasoning differ in the older and younger child? Do 
older children show different processes of thought, or merely superior speed 
and complexity? (Piaget, 1947). 

Only a brief summary of his conclusions can be offered. He reports that 
the changes are qualitative, that the older child thinks in quite a different 
Way from the younger. The child must first learn to make perceptual com- 
Parisons and to abstract from his sense impressions certain constructs or 
"schemata." His first schemata are merely the identifications of objects: for 
instance, the recognition of his mother as the same person no matter how 


her dress, posture, and other superficial appearances change. He gradually 


builds one schema upon another, thereby acquiring a repertoire of tools of 


thought. Once he realizes that “an object" exists, he can think of it as con- 


tinuing to exist even when hidden; this stage is necessary before he can be 


expected to find a hidden object. He later develops ideas of shape (constant 
changes), size, identity, order, etc. For ex- 
ample, the preschool child may be able to compare the size of two blocks, 
Selecting the larger. There is a certain age where he can judge each pair cor- 
rectly, and yet cannot arrange a whole series in order. He focuses on one pair 
at a time, and cannot think of the overall order. A schema or idea such as 
"order" may first appear in a concrete form; i.e., the child can compare two 
bead chains only when they are laid out side by side. Then he learns to 
hold the abstract order in mind so that he can compare, for example, a 
Straight chain with one twisted in a "figure eight." When the idea of order 
is completely abstracted, he can solve logical problems such as "Town A is 
north of B, and C is south of B; what can you say about A and C?" 

This type of research (see also Harlow, 1949) is beginning to isolate a 
Strictly intellectual aspect of the person’s reactions to the world. Solving 
any problem, it is argued, calls for the possession of curate schemata. The 
Schematic interpretation replaces the immediate, Gestalt impression. To re- 
Produce a Block Design pattern efficiently, for instance, the child must dis- 
Tegard the overall pattern and divide the figure mentally into equal squares. 

The person does not perceive his world as a physical event. Rather, he 
Creates a picture in his mind, building up that picture by using whatever 
Schemata he knows and considers important. This abstract picture, being 
Simpler than the world, lends itself to formal, accurate reasoning. Piaget 


as well as workers in other centers, are now translating 
dividual measurement. The bead 


tests will perhaps permit an 
inking. It is too early to say 
importance; it is a good bet 


€ven though the retinal image 


and his associates, 
his experimental procedures into tests for in 
tests mentioned above are an example. Such 
inventory of the individual’s equipment for th 
whether these tests will have direct practical 
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that the total score on such a test will correlate highly with Binet’s hodge- 
podge scale. Tests based on modern cognitive theory seem certain, how- 
ever, to advance our understanding of thought processes and of the experi- 
ences which improve intelligence. 


17. Investigators of the aged argue (J. Anderson, 1956, pp. 162, 170-172) € 
mental tests for the older adult should call for maturity of judgment, so Es 
they would be similar to the intellectual requirements of the person's daily l S: 
What sort of test items would meet this demand? Could such tests be apple 
to adolescents? . di 

18. One of the subtests of the Tanaka group intelligence test Dauid x 
subject to cross slanting lines, making X's as rapidly as he can, t i. 
XXXXX////// Such a subtest is rarely used in American Inte i 
gence tests. On what basis could the inclusion of such a test be criticize 1 
What argument or evidence would justify including this subtest in a genera 
mental test? dius 

19. Tuddenham (1948) gave Army Alpha to a representative sample of solc nt 
drafted in World War Il. Comparing these data with norms for white pers 
soldiers in World War |, he found that whereas only 17 percent of ie 
World War | group had raw scores greater than 104, this score was 
median of the World War Il sample. How can this difference be explained"! 


Suggested Readings 


Examiners manual, Cooperative School and College Ability Tests. Princeton 
Educational Testing Service, 1957 (or later edition). ‘ 
This is a good example of a modern manual. Study of all sections will a s 
able. It explains the test maker's decisions about what to measure, 
report scores, etc., along with clear details of standardization research. M 
Hebb, D. O. The growth and decline of intelligence. The organization of 
New York: Wiley, 1949. Pp. 274-303. ibed which 
Clinical studies after brain surgery and studies of animals are describ sion de: 
indicate that innate potential can be distinguished from comprehen of 
veloped in a particular culture. Hebb's theory emphasizes the impor 
appropriate early experience to develop ability. : sance an 
Heim, Alice W. Validating intelligence tests. The appraisal of intelligence. 
don: Methuen, 1954. Pp. 96-112. judge the 
Heim describes five types of investigation which may be used id ia e of 
adequacy of a general mental test and shows the way in which eac typ 
study, considered alone, might be misleading. . 
Tyler, Ralph W. Can intelligence tests be used to predict educability? In sity 
Eells & others, Intelligence and cultural differences. Chicago: Univer 
Chicago Press, 1951. Pp. 39-47. " 
Tyler distinguishes between tests designed to predict success 1n P ersons 
cational treatments and tests which might be designed to select able p 


: T der othe 
who will not succeed in present treatments but might do well un 
methods yet to be invented. 


profit- 


havior- 


Kenneth 


sent edu 


E 


Factor Analysis: 
The Sorting of Abilities 


NEARLY all the tests considered to this point grew out of Binet’s original 
discovery that complex problems measure general adaptive ability better 
than do simple tests of reaction and discrimination. Most innovations in 
ability testing since 1920 have been concerned with narrower abilities re- 
quired in particular jobs or school subjects. Separate measures of verbal, 


mechanical, numerical aptitudes, etc., were designed, and many of them 


have proved valuable in guidance and personnel classification. We might 


merely describe these tests and summarize data on their validities, but such 
a catalog would be endless. It will be better to look first at the modern 
techniques of classifying abilities which guide the development of such 
tests. 

Factor analysis is a systematic method for exami 
by studying its correlations with other variables. 


large collection of tests to the same persons. The an 
abilities are being measured reliably, to detect additional 
g the tests, 


ning the meaning of a test 
The investigator gives a 
alysis tries to determine 


how many distinct 
trace" abilities which could be measured reliably by modifyin 


and to reduce the confusion which results when the same ability is given 
different names in different tests. Factor analysis gives information about 
ation of individual characteristics and clarifies what 


the nature and organiz 
attitudes, and per- 


any given test measures. It is used in studies of interests, 
Sonality as well as in studies of ability. The purpose of this chapter is to clar- 
ify what factor analysts are doing and to show how a factorial study is in- 


terpreted. 

It is hard to gain even a partial understanding of factor analysis. The 
technique is complicated, though the basic idea is as simple as correlation 
itself, The results of investigations have often disagreed, perhaps chiefly 
because some of the older work used crude techniques and insufficient data. 
During those earlier days, substantial controversies developed whose echoes 
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still confuse current discussion. Fortunately, these issues have largely been 
settled, and factorists agree on many basic facts. . 

Although factor analysis is mathematical, it involves considerable judg- 
ment. The investigator chooses whatever method of organizing his results 
makes the best sense to him, and the result is variation among studies. This 
is confusing, just as it confuses the beginning student of geography to find 
different maps picturing Greenland in different ways. These differences are 
of little concern to the nonspecialist; the important thing is that all maps 
agree that there is such a large island in the North Atlantic. In our discus- 
sion of factor analysis we shall concentrate on such major features of the 
landscape and omit technical details. 


THEORY OF FACTORIAL ANALYSIS 
Interpreting Sets of Correlations 


Looking at a collection of scores such as the Wechsler subtests, We s 
face the question, Just how many different abilities are present? The bens 
ability in such a question refers to a group of performances all of ya 
correlate highly with one another, and which as a group are distinct > : 
(have low correlations with) performances that do not belong to the i ei 
(Vernon, 1956). Vocabulary tasks perhaps define such a group. They un 
together, but are they distinct from other types of items? To take a aee 
example, Wechsler Vocabulary items call for recall of word pers 
Wechsler Similarities items call for verbal comparison of concepts. Are ledge 
the same ability? Or can we interpret one as measuring word know 
and the other as measuring verbal reasoning? 


For a group of junior-high-school students, 


i} 
to 
© 


reliability of Vocabulary 
reliability of Similarities = .80 
correlation of Vocabulary and Similarities = .52. 


at 
The two tests evidently overlap. Squaring the correlation of .52 tells api 
27 percent of either test can be regarded as representing a common "larities 
lapping "factor." The reliability indicates that 20 percent of the € jances 
variance is due to error. This leaves 53 percent of the Similarities Vira ‘ pot 
this nonoverlapping remainder must be due to some distinct abili Ua an 
common to Vocabulary. Likewise, 63 percent of Vocabulary is du facility 
ability not involved in Similarities. There is a common factor of verba i he 
or reasoning, but each test also involves something independent. Hen 
two tests do involve distinct abilities. 


‘anit 
m g rrelatio 
Factor analysis works along these general lines, starting from co: 
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Correlation indicates whether tests possess a common element. Binet ap- 
plied such reasoning when he decided that his tests, all having a substantial 
relation to each other, must be influenced by the same common factor, gen- 
eral intelligence. Wissler, whose tests had very small intercorrelations, con- 
cluded correctly that his tests had very little in common and therefore 
represented different abilities or, as we would now call them, factors. 


TABLE 27. Intercorrelations of Three 
Tests for Navy Recruits 


A B jc) 
81 69 
69 


APD 


To simplify tables, each correlation is pre- 
sented only once. The correlation of A wit 
(or B with A) is .81. Symmetrical entries 
could be made below the diagonal if desired. 
Source: Conrad, 1946. 
The factor concept can be illustrated by means of a series of correlation 


tables. Table 27 gives correlations of three Navy classification tests with each 


other. These data suggest two conclusions: 
Because the correlations are generally positive, the tests must be af- 
fected by some common characteristic. 
Tests A and B have more in common than either has in common with 

test C. 

The reasonableness of such a result is clear when we find that A is the 
General Classification test, B the Reading test, and C the Arithmetic Reason- 
ing test. Probably the common element in all three tests is a composite of 
ability and past learning. Two verbal tests may well have 


general reasoning 
as in common with a mathematical test. 


more in common than either h 


TABLE 28. Intercorrelations of Four Measures for Adult 


Workers 

Arithmetic 

Reasoning Turning Assembly 
Vocabulary .66 .06 14 
Arithmetic Reasoning .03 16 
Turning -38 
Assembly 

Source: Guide to the Use of GATB, 1958, III, G-1.1. 


Table 28 has a very different pattern of correlations which shows clearly 
the presence of two distinct abilities. A verbal-educational ability is found 
in the Vocabulary and Arithmetic tests. Some psychomotor ability affects 
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both the Turning test (placing pegs in holes) and the Assembly test (as- 
sembling a rivet and washer). 

A formal factor analysis goes beyond inspection and calculates how much 
each test is influenced by the various factors, as we did by a simple but in- 
efficient method for the two Wechsler subtests. (For procedures see Thur- 
stone, 1947.) 


1. Table 29 presents correlations between six tests of the Navy classification bat 
tery. Does there appear to be a single common factor among all these tests? 


If so, what might be its psychological nature? 


TABLE 29. Intercorrelations of Six Navy Classification Tests 


General Electrical Mechanical 
Classification Arithmetic Mechanical Knowl- Knowl- 

Reading Reasoning Aptitude edge edge 
General Classification 81 " 60 53 E. 
Reading 69 06 E A 
Arithmetic Reasoning 61 47 "55 
Mechanical Aptitude 53 p 
Electrical Knowledge 7 


Mechanical Knowledge 


Sounce: Conrad, 1946, 


2. Which pairs of tests in Table 29 seem to have the greatest overlap? 


The Three Types of Factors 


and 
Three types of factors are commonly distinguished: general, group» 


specific. A specific factor is present in one test but not in any of the t 
under study. A group factor is present in more than one test. A at of 
factor is a factor found in all the tests. If all the correlations among 8 5° ad 
tests are positive, one can find a general factor. If there are any zero or eol 
tive correlations, a general factor will ordinarily not be found ES 
43). The mathematical methods of the factor analyst determine the eet 
tion between each test and each factor. These correlations provide 2 pe^ 
of “factor loadings.” The square of the factor loading tells how much € 
factor contributes to the variance of the test (cf. Table 30). cal 
Many factorial studies must be completed to arrive at psycholog ® 

theory (Ahmavaara, 1957). Just what factors appear and what form T s 
take depend on what tests are correlated. If we analyze only numerical te sts 
there will be a numerical general factor. Put two or three numerical t° 
into a mixed collection, and the same ability shows as a group facto" 
just one numerical test in the battery, and the factor will be specific. 
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Possible correlations among three variables: 


12 3 12 3 T2 3 12 8 
1 0 .0 1 m ud 1 5 0 1 à. ê 
2 0 2 2 0 2 4 


Corresponding factor patterns 
(general factor heavily shaded; group factor lightly shaded) 


o9 Be 40 A 


General and Group and General, group 


All factors specific I 
i specific factors specific factors and specific factors 


It is also possible to have general and group, but no specific factors; 
group factors alone; or a general factor alone. These are unlikely. 


FIG. 43. Possible factorial relations among tests. 


to emphasize information about a general 
factor, about group factors, or about specific factors present in various sub- 
tests. Demonstrate the truth of this statement. — . 

4. Confidence may be manifested in a variety of situations: making a speech to a 
woman's club, taking one’s car apart to repair it, piloting a jet plane, or going 
to a show instead of cramming for a test. Give three alternative explanations 
of the nature of confidence: one in which it is considered as a general factor, 
one in which it is divided into grouP factors, and one in rs it is considered as 
a number of highly specific factors. Which theory do you thin sp most adequate? 

5. Confidence is to be considered in selecting future fighter pilots. How would a 
psychologist test confidence for this purpose if he believed it to d a broad 
general trait? How would he proceed if he considered confidence to be specific 


to a particular situation? 


3. The Wechsler test can be scored 


How Factor Analysis Groups Tests. We shall now examine several illustrative 
results. Our first example treats the Navy classification tests whose correla- 
tions were presented in Table 29. These tests had vations part scores. Peter- 
son was asked to determine how many different abilities were being meas- 
ured, so that testers could report to classification officers all the scores giving 
distinct information without reporting the same ability under different 


names. 

To answer this question, P 
among the tests and subtests, an 
shown in Table 31. There are three group fact 


eterson chose to break up the general factor 
d rotated to obtain the “simple structure” 
ors, which may be interpreted 


252 ESSENTIALS OF PSYCHOLOGICAL TESTING 


TABLE 30. Approximate Factor Loadings and Factor Com- 
Position of the Navy Mechanical Comprehension Test 


Factor Percentage of 
Factor Loading Variance 
Verbal-educational 35 12 
Mechanical experience .64 41 
Quantitative reasoning BI 01 
Total common factors V 54 
("communality") 
Error, if ri, = 79 21 
Unique .50 25 
100 
These values are Presented solely to illustrate the form of a factorial re- 


sult. Though they are derived from D. Peterson's (1943) findings (see text) 
they do not give an adequate analysis of the mechanical comprehension test. 
For more complete results, see Table 31, 


TABLE 31. Factor Analysis of Navy Classification Test Scores 
Test Subdivision 1 ll Ill Specific 


Reading 70 0 0 b 
General Classification Opposites 76 0 0 ^ 
(GCT) Analogies 73 o 0 : 
Series Completion .68 o i 
Arithmetic Reasoning Arithmetic Reasoning 56 o à i 
(AR) 
Mechanical Knowledge Tool Relations te) 69 D : 
(MK) Mechanical Information x 59 9 x 
Electrical Comprehension x 67 o x 
Mechanical Comprehension x 64 9 4 
Mechanical Aptitude Block Counting 0 0 61 y 
(MAT) Mechanical Comprehension 0 x 32 65 
Surface Development x x * 


his 
"m x ] , 30. In t 
alysindicates factor loading between .20 and 50; 0 represents negligible loading, below .20 
esas there are small correlations between factors which the discussion in the text ignores. 

OURCE: D, Peterson, 1943, 


by examining the tests where they appear. Factor I in this study eee 
given the name Verbal-Educational (frequently designated v:ed in Bri 4 
reports). Factor II can be called Mechanical Experience. Factor II ken it 
be named without more facts about the tests than we have given, py rn 
seems to involve quantitative reasoning. Note that two tests named 
chanical Comprehension” measure different factors. T 
Peterson found only three common factors in the twelve tests. In ae 
each test measures some specific ability. Specific-factor loadings are on 
small except in Block Counting and Surface Development. Thus the anaty 
suggests that nearly all the information in the twelve scores can be reporte 


"o To 
in five scores: Factor I, Factor l, Factor III, and two specific factors. 
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simplify the record of the recruit, GCT, Reading, and AR could perhaps be 
combined into a verbal score. MK could be kept separate as a measure of 
mechanical experience. It would presumably be valuable to extend MAT to 
obtain better measures of the three factors it contains. By not scoring sub- 
tests of GCT and MK, and by pooling similar tests, the tester would elimi- 
nate seven scores from the record a classification interviewer has to inter- 
pret. This condensation, however, would be too drastic. 

Factor analysts concentrate on large factor loadings and often ignore load- 
ings below .50. Looking only at the larger factors would imply that GCT, 
Reading, and AR duplicate one another, and that since each test is reliable 
the Navy could drop two of them. Many factor analysts would have made 
Precisely this drastic recommendation. But Peterson did not, and he was 
Correct, 7 

Specific factors and group factors with loadings below .50 may be of con- 


Siderable importance to validity. Table 32 shows how the three tests predict 


TABLE 32. Validity of Tests Loaded on the vied Factor 


for Predicting Service-School Grades 
inl ithmetic P 
p fin GCT Reading 
31 30 
Basic engineering 2 55 A2 
Electrician's mate 334 .25 -34 
Fire control 753 .37 36 
Quartermaster 33 54 40 
Cooks and bakers 13 16 26 
Storekeeper t 


Frederiksen and Satter, 1953. 


Source: 
training courses 
m enter. In those 
Brades at Great Lakes Naval Training C in. Deter predictor ia 


y " to 
Which require arithmetic ability, AR tends drop the AR test or to pool 
© other tests, It would have been a mistake to 


i III or the specific factor in 
it wi ither Factor 

oe (qiie sm to validity, even though the load- 
ant contr 


hile factor loadings suggest how tests may 
y s s £ testing programs should rest on in- 
design tors including the specific ones. 
c ecall that a single factor analysis 
ay T a set. If one of the same tests 
at different factor composi- 


i is making an import 
"BS are below .50. In gener: 
© Brouped, final decisions on A 
"mation about the validity of all fa 

n interpreting this example we F dum 

s Cntifies the factors in a set of e = ae 

i ; ion, i 
in `S included in a different Mond if numerical tests were peer 
ud Ne ens veis vex de numerical Ed (m boss ive 
Ty, Factor III might di We shall see later 
ieee nad à geometric MSN d in other batteries. 


e : > when analyze 
St performs rather differently W h 


254 ESSENTIALS OF PSYCHOLOGICAL TESTING 


How Factor Analysis Interprets Tests. The most extensive factorial avion 
tion of tests has been done in Air Force research. Hundreds of tests of : 
sorts were used to select pilots, navigators, and bombardiers. Analyzing the 
correlations among many tests on large samples led to factorial interpreta- 
tions such as those graphed in Figure 44. 


Numerical 


Misc. 
Common Spatial Il 
Factors 3% 18% 
Psychomotor 
Speed 1% 


Deus 


Reasoning 1% Unreliability 
21% Psychomotor 24% 
Precision 2% Reasoning 
Unique Factor 6% 3% 
" f t. 
Numerical Operations, a nearly pure test. Directional Plotting, a mixed im 
Subtraction and division exercises, Reporting direction between two P' 


rdi art. 
when given their codrdinates on a © 


Psychomotor 
Coördination 
16% 


Spatial II 


—7, 24% 
Mechanical Exper...3% 
Visualization. % 
Reasoning | % 
Spatial Il 2% Unre- 
Judgment.. 1% Hablity 


Math. Background.1 * 


Planning. Er 
Psychomotor Precision 1% 


Perceptual Speed 4% 
Pilot Interest......-- 3% 


Misc. Common Factors 3% 


Complex Coérdination, a highly complex test. 
Job replica apparatus test. 


sis of a 
FIG. 44. Tests of different factorial purity. Factor loadings were determined by analy 


minent 
complete battery of AAF classification tests (Guilford, 1947, pp. 828-831). The most PI? 
factor in each test is shaded. 


TI 
The three tests are seen to be quite different in structure. ipai 
Operations contains only one important factor, whereas Directional Tct 
is influenced by four or five factors. Though Directional Plotting may pe o in- 
some criterion that demands just this mixture of abilities, it is difficult Sai 
terpret psychologically. Complex Coürdination is a test in which the p° 
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handles a stick and rudder in response to light signals. Its largest factor is 
one not found in other tests at all. The physical and psychological complex- 
ity of the test is reflected in the variety of factors it involves. This complexity 
evidently matches in some way the complexity of the pilot's job, for Com- 
plex Coórdination is one of the best predictors of pilot success. (See also 
Figure 2 and pp. 303 ff.) 
6. Compute the percentage contribution of each factor to the Reading test 
(Table 32), and make a composition diagram for it. Assume a reliability of .85. 
7. Make a composition diagram for Block Counting. 


“Simple Structure” 


factor analysis is that called “rotation.” Rota- 
he factors so that the results will be most 
ided to eliminate the general factor 
analysis to emphasize more 


One important step in any 
tion is a procedure for placing t 
meaningful. Peterson, for example, dec: 
which his correlations indicated, breaking up the 
diagnostic group factors. 

A correlation table describes similarities between one test and all other 
tests. The factor analyst introduces artificial variables or “factors” which can 
be readily interpreted, and describes the test by its relations to these factors. 
The process is like that of describing the location of a home. Jones lives next 


to Smith and Adams, half a block from Brown and White, three blocks from 


James, Thomas, and Schultz. This description (which resembles a row in 


the correlation table) is useless if the person seeking Jones does not know 
where these others live, and inconvenient when he does know. So we 
introduce a reference system. We locate Jones as north et Main Street and 
west of State. Or we say he lives on this side of the highway, across the 
railroad tracks, and beyond the ice plant. We can place any home in relation 
to these reference lines. All the alternative descriptions are correct, differing 
only in completeness and communication value. 

The principle of “simple structure” was suggested by L. L. Thurstone, the 
great American pioneer of factorial methods. His scientific aim was to de- 
Scribe complex performances a$ composites of simpler performances, i.e., 
to break test scores into more fundamental elements. For example, SB 
Memory for Sentences might be described as depending on verbal ability 
(three-tenths) and on memory ability (seven-tenths). Thurstone planned 


his factor analysis to find group factors having small loadings in some tasks 
RE ; i “si tructure” is one in which a large 

and adings in others. A “simple s ] ; 
large loading e near zero, so that each test is described in 


number of factor loadings a” : 

terms of just a few factors. Thurstone aimed first to track down group 

n so jet Jd have zero loadings in some tests. Second, he aimed to 

e ades wou spat tests each of which would have a high loading 
‘over or design 
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on just one factor. Numerical Operations is one such test: dt ae 
skill demanded by many tests and criteria, but it is almost entirely independ- 
ent of verbal, reasoning, and other nonnumerical abilities. -— 

British investigators have been less interested in pure measures of simp e 
abilities. They, instead, rotate so as to identify broad factors present in à 
large number of tasks, v:ed being one example. 


THURSTONE'S "PRIMARY MENTAL ABILITIES" 


The effort to isolate simple abilities is perhaps breaking down today, m = 
begin to suspect that no matter how far we advance the number of spe 
still to be isolated always stretches far beyond the horizon. But the rat! = 
simple system of factors which Thurstone proposed in 1938 has had great in 
fluence on all subsequent classification of abilities. 


Description of the Factors 


Thurstone (1938) gave 56 tests to students at the University of era 
and found six predominant factors: Verbal (V), Number (N), Spatial aa 
Word fluency (W), Memory (M), and Reasoning (R). Subsequent ae 
also have found these factors to be useful reference axes, though nears 
in particular is treated differently in recent work. Thurstone published ** s 
lected set of relatively pure tests to measure these "primary mental ne dis 
Items from the *PMA tests" are shown in Figure 45. To understan to 
Thurstone factors we can examine these items and also SB items know? 
have loadings on these factors. re- 

The verbal factor V is found in vocabulary tests, and in tests of ji in 
hension and reasoning. There are verbal loadings in SB vocabulary, comp 
hension, verbal absurdities, and other tests. 

The number factor N appears in simple arithmetic tests. Tests © 
metic skill are purer measures of N than are tests of arithmetic rea kwar 
Giving the number of fingers on one hand, and repeating digits bac 
are among the Binet items with loadings on N. . dings 

The spatial factor S deals with visual form relationships. Spatial loa fus 
appear in picture absurdities, copying a diamond, drawing a desig? 
memory, and paper cutting. 

The memorizing factor M appears in tests which call for rapid ro 
ing, including memory for words, digits, and designs. 

Reasoning, R, appears in tests requiring induction of a rule from 
instances. Reasoning factors appear in the SB plan-of-search test, 1n 
(water-jar problems) and similarities between concepts. 

The word-fluency factor W (which is clearly distinct from v) e 


f arith- 
soning: 


te learn" 


sever al 
genui 


alls fot 
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Today much of our clothing is designed to make a fashionable ap- 


pearance rather than for 
style protection children sale dresses 

Synonyms: quiet blue still tense watery 
Is this addition right or wrong? 42 

61 

83 

176 

Mark every number that is exactly three more than the number just 


before it: 
4 11 14 10 9 12 16 8 10 8 


Put a mark under every figure which is like the first figure in the row. 


M I Study associations such as “chair-21” and "box-44." Mark the correct 
number on a later test. 


Letter series (Which letter comes next?) 
abxcdxefxghx . - - 


| Letter groupings (Which group is different?) 
AAAB AAAM AAAR AATV 


W f List as many four-letter words beginning with C as you can. 


ntal Abilities. (Copyright 1941 by L. L. and 
Thurstone and Science Research Asso- 


ests of Primary Me 


FIG. 45. Items from the Chicago T 
d by permission of Mrs. 


Thelma Gwinn Thurstone. Reproduce 
ciates.) 

ability to think of words rapidly, as in anagrams and rhyming. It is not found 
in the Stanford-Binet. The distinction between V and W is shown in two 
Synonym tests tried by Thurstone. A test requiring the subject to select the 
Correct synonym from several choices was saturated with V but not W; a test 
in which the subject rapidly supplies three synonyms for an easy word meas- 


ured W, not V. 
Thurstone’s list of primary ab 
word “primary,” however, su£ge 


ilities is a convenient reference system. The 
sts that the list is more than a matter of con- 
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venience, that it represents something fundamental about the way the 
mind works. This implication raises questions which it has taken twenty 
years of research to answer: , 
Is general ability nothing but a mixture or average of the primary abil- 
ities? 
Are these factors the only ones into which these tests could be divided? 
Are these factors unitary and indivisible? 
Is this a complete list of mental abilities? 
Is this factor structure a reflection of innate human nature or of cult 
influence? 
To these might be added questions regarding predictive validity of the 
factors, but information on that subject will be accumulated through severa 
succeeding chapters. 


ural 


8. Do Thurstone's tests cover the same ground as the Kuhlmann-Anderson test? 
Can you find Kuhlmann-Anderson items which appear to represent each Thur- 
stone factor? ssa 

9. Which Thurstone factors are most consistent with Binet's description of intel 
ligence? 

10. What factors would you expect to influence Wechsler Vocabulary scol 
Digit Symbol? 


es? 


The Status of General Ability 


. se 
Thurstone intended by the name “primary abilities" to suggest that the 


abilities combine to produce aptitude for any complex intellectual Ta 
formance, just as green, red, and blue spotlights can be mingled opi i a 
any other hue, or white. If this is true, general mental ability is nothing : o 
mixture of primaries in some proportion. In sharp opposition is the pua 
Galton and Spearman that some persons are endowed with superior EeP iot 
adaptive ability which might be turned in various directions. This -— 
of views was sharpened when Thurstone found no general intercorrel@ zi 
among his Chicago tests. Since he found near-zero correlations betwe 
ability tests, he argued that no general factor exists. ions 

Subsequent research has altered his argument. The low ae one 
proved to be due to the very restricted range of the University of Cine en- 
sample. In less select groups, even Thurstone and his associates found £ 
eral intercorrelations. As Burt (1958, p. 5) says: 


a] facto? 


In nearly every factorial study of cognitive ability, the gener o in the 


commonly accounts for quite 50% of the variance (rather mor ho 
case of the young child, rather less with older age groups) while pac 
the minor factors accounts for only 10% or less. . . . For all prac 
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purposes, almost every psychologist—even former opponents of the 
concept of general intelligence, like Thorndike, Brown, Thomson, and 
Thurstone—seems in the end to have come round to much the same 
conclusion, even though, for theoretical purposes, each tends to reword 


it in a modified terminology of his own. 


The issue then reduces to how to take the general factor into account. 
Holzinger in America and Burt in England preferred to pay attention to 
the general factor first, and then to see what further information group fac- 
tors add. Thurstone preferred to concentrate on group factors, and to account 
for the overall relation by identifying a “second-order” factor which unites 


the groups. 


The Determinacy of Factors 


as often been regarded as a list of the basic elements of 
persons have compared it to the chemists list of ele- 
w "fac- 


Thurstone's list h 
the human mind. Some 
ments. Others, critical of the approach, have condemned it as a ne 
ulty" psychology. 

Factor analysis is in no sense comparable to the chemist's search for ele- 
ments. There is only one answer to the question: What elements make up 
table salt? In factor analysis there are many answers, all equally true but 
not equally satisfactory (Guttman, 1955). The factor analyst may be com- 
pared to the photographer trying to picture a building as revealingly as 
possible. Wherever he sets his camera, he will lose some information, but by 
a skillful choice he will be able to show a large number of important features 
of the building. 

The fact that many other investigators find simil 
seem as if Thurstone's list did embody some fundamental truth. Yet his list 
does not include anything like the v:ed factor of the British investigators, 
and his N factor is defined by simple arithmetic skill rather than by reason- 
ing. Location of reference factors is a matter of judgment. 

Thurstone’s choice of his particular factors was dictated by a criterion of 
simplicity. He wanted irreducible factors and therefore matched his factors 
to very simple tests wherever he could. A test whose items seemed, on in- 
spection, to involve many types of mental process would not satisfy him as a 
measure of a pure factor. This explains, for example, why N was defined in 
terms of elementary, overlearned computational skills. — 

The meanings of factors shift from time to time as new evidence and new 
criteria are introduced. As We shall see, N, R, and S have somewhat different 
tudies from the meanings they had in the 1938 list. 


ar factors has made it 


Meanings in current s$ 
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Divisibility of Factors 


Thurstone and his students discarded the view that factors are irreducible. 
While verbal tests have enough in common to define a "verbal factor," they 
can be divided into several subgroups, thus establishing narrower factors 
within the verbal domain. One can divide a vocabulary test, for instance, into 
subgroups of words from different content areas. Within the subfactor of 
"science vocabulary," we would find that some students know more cher 
cal terms than psychological ones. This could be pursued down to ridic- 
ulously fine detail. Other factors subdivide similarly. 

Factor analysts now recognize that abilities are most clearly described 
by a hierarchy ranging from the very broad factors to those present only m 
very specific tests. One can plan his statistical analysis to find only the high- 
level factors, to find only factors of intermediate breadth, or to isolate 
dozens of detailed factors. Many investigators have suggested possible 
hierarchical arrangements, but all the proposals are tentative at present, 
subject to verification by further data. Vernon's diagram (Figure 46) is uas 


General (g) 
Maior Group Verbal-educational (v:ed) Practical (k:m) 
Factors 
Minor Group Verbal Number Mechanical Spatial Manual 


Factors (v) (n) Information (k) 


= mmn mmm 


FIG. 46. Sketch of a possible hierarchy of abilities. (After Vernon, 1950, pp- 22-23) 


. x j : ctors 
such suggestion. (For an actual factor analysis deriving hierarchical fa 


at three levels, see Moursy, 1952, and Laugier, 1955, pp. 187-208.) 


Completeness of the List 


;ble 
The preceding remarks have already indicated that the number of pe £s 
factors is inexhaustible, if we are willing to make the factors sufficie 
trivial. The question remains, however, whether significant factors "These 
discovered beyond Thurstone's list. The answer is emphatically "yes. 59): 
remarks were written at the end of World War II (F. B. Davis, 1947, p. 


med 

The results of testing hundreds of thousands of men in the um that 

forces and of analyzing these data suggest to many psycholog, acest 
the number of basic mental abilities may often have been ur 
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mated. From factorial analyses of many different matrices of intercor- 
relations obtained as a result of testing aviation cadets in AAF classifica- 
tion centers, factors that have been mathematically determined have 
been named as indicated in the following list. 

Perceptual speed 

Pilot interest 

Planning 

Psychomotor coordination 
Psychomotor precision 

Psychomotor speed 


Carefulness 
General reasoning I 
Integration I 
Integration I 
Integration III 
Judgment 
Kinesthetic motor Reasoning II 

Length estimation Reasoning III 
Mathematical background Social science background 
Mathematical reasoning Spatial Relations I 


Mechanical experience Spatial Relations II 
Spatial Relations III 


Memory I 

Memory II Verbal 
Memory III Visualization 
Numerical 


mining whether the names at- 


There is no objective method of deter 
analyses are accurate descrip- 


tached to the factors discovered in the 
tions of the mental abilities represented by the factors. In any case, . . . 


the number of basic mental abilities may be much larger than was 


formerly believed. 


Some of these added factors came from the extension of factorial investi- 
gations to psychomotor tests. Some came from bringing new pencil-and-paper 
tests into the analysis. Some came as a result of subdivisions—but not trivial 
subdivisions—of the Thurstone factors. 

One gets out of a factor analysis only what he puts in. This remark has be- 
come trite, but it is of basic importance. Factor analysis sorts the abilities 
present in the test battery; it does not unearth new ones. Thurstone identified 
the common elements in tests such as psychologists had been generally 
using. If psychologists had not yet designed tests covering some important 
ability, that ability could not show in Thurstone’s list. The Air Force invented 
and tried out many additional possibilities but by no means covered the 


range of possible ability tests. 


Origin of Factors 


e who believe that factor analysis is identifying “the way 


Many of thos: 
are organized” think that biological nature determines what 


human abilities 
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factors are found. It is conceivable that perceptual speed and spatial judg- | 
ment rely on different neural processes, and that a person could be su- 
perior on one process and not the other. It is equally possible to argue that 
correlations between abilities are produced by experience. Numerical per- 
formances develop together, presumably because they are taught together. 
A child who makes a bad start in arithmetic because of poor teaching will 
lag in all numerical tasks even though he may do well in verbal work. There 
is no need to conclude that the separation of numerical from verbal facility 
is inborn. On the other hand, we may yet find intellectual patterns of un- 
deniable hereditary origin. Many sensory differences (e.g., color vision) are 
of this character. 


11. One student says, "It seems to me the factor analysts are like astronomers try- 
ing to discover planets. The astronomer finds a new planet by detecting the 
pull it exerts on already known bodies. Then he makes more careful studies f? | 
check his conclusion and locate the planet exactly. The factor analyst Jocaies ! 
one test against already established abilities.” How satisfactory is this com 
parison? 

12. Another student suggests that factors are comparable to constellations of pa 
which the astronomer uses to label portions of the sky (e.g., "The nebula is ^ 
Orion"). How apt is this comparison? 


THE PRESENT STATUS OF FACTOR ANALYSIS 


From some points of view factor analysis has been a great success. It pe 
vides precise methods for handling large numbers of variables and for ein 
ducing them to a much smaller number of scores with little loss of vine 
tion. Thus factor analysis is a highly important statistical method. HE 
factor analysis has cut through a large amount of nonsensical interpreta res 
which results from assuming that every test with a different name OE aie 
a different ability. Thirdly, factor analysis helps to describe what a test att 
ures. It is gradually establishing a reference system that all psycholog 
can use to describe tests. 

Some critics of factor studies were disappointed when they fo f the 
all factors measured practically significant mental abilities. Even one 1 fac- 
pioneers in the field (Kelley, 1939) spoke of the discoveries as “menta di ‘© 
tors of no importance.” Probably the correct position to take is that fa not 
studies clarify what present tests measure, They cannot identify ane o 
built into the original tests. They cannot guarantee to produce ME the 
practical importance. But by clarifying the content of tests they pe to 
psychologist to decide whether he is satisfied with them and help m il- 
throw out the components that are useless, Furthermore, the sorting s cap 
ities directs research to the question: For what is each of these d 
talents useful, and how can we capitalize on it? 


und that not 
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analyst has been to discover a dependable list 
abilities have been defined, and tests which 
st of these factors are available. R. B. Cat- 
) has proposed that an "international in- 


me e well-established factors be prepared, but the other leading factor 
"he haee argue that the definitions of factors are still shifting and that to 
bem ze any present list would be premature. The difficulty is not that factor 
am ysis fails to analyze data correctly. The problem is that, with so many 

erent ways of placing reference factors, it is unlikely that the best pos- 


Sible system has been found. 

Pii investigators feel that factor analysis has paid undue attention to 

i content of the test item, i.e., to whether it deals with words, numbers, 
tms, or other symbols. Content groupings are of course to be found in tests 

Using different content. A more fundamental problem, however, is the or- 


Banization of mental process. Thurstone distinguished three such processes: 
cv. Meili (1946, 1955) found evidence for 


n of such ideas as order), plasticity (restruc- 
Hidden Figures), and integration (titling a 
‘ons of Thurstone’s reasoning factor. At- 
g in number but are not near to final re- 


The great goal of the factor 
of the important abilities. Many 
give fairly pure measures for mo 
tell (Laugier, 1955, pp. 319-325 


m 
San reasoning, and fluen 
tu PaPa complexity (applicatio 
ane as in Block Design or 
EOM These may be subdivisi 
Pts to study pr are growin 
sults, ly process g 
A tentative three-way organization of intellectual tasks has recently been 
Suggested by J. P. Guilford, as an outcome of a long series of studies of high- 
evel intellectual performance: Guilford (1957) distinguishes five types of 
m 
ental Operation: 
Memory—retention of plu adi 
ognition—recognizing patterns, acts, ete ‘ "ET 
Convergent ting proceeding from information to a specific “right ate 
swer” " 
from informatio 


Sergent thinking—proceeding ~ lot) 
Solutions (as i ing titles to fit @ P i i 
Mos e iris i Rer goodness or appropriateness of ideas (e.g. 

judging which problems are significant) 


n to a variety of adequate 


Tasks within each of these categories can be classified with respect ta 
Content” and “product.” The content categories are figural ami per. 
Ceived object P E awings etc); c (letters, numbers, etc.), 

jects, events, ar > tion of human behavior). 


Se z t » (i ta 

mantic (verbal), and “pehavioral (in pena ilford are units of informa- 
* six kinds of “products” distinguished by OM E 

" s of “products cen units, systems of information, trans 


Ah lasses of units, relations betw 

Sin tions, and implications 
Ucts "à there are five oper 
> there are 120 different c9 


symboli 


ions, four content categories, and six prod- 
tions, Each combination represents a 


mbinations- 


264 ESSENTIALS OF PSYCHOLOGICAL TESTING 


type of task which is or can be represented in an intellectual test, according 
to Guilford. For example, the common verbal comprehension tests fit into 
cognition of semantic units of information. A test which asks the subject to 
find the name of a sport concealed in the sentence "He chose a Mongol for 
his bride" is classified as convergent thinking—symbolic—transformation. 

Guilford’s system is still undergoing development, and it is not yet clear 
how well his categories reproduce the empirical relations found through 
factor analysis. The system emphasizes the distinction between test content 
and test process and is therefore an advance over the Thurstone explora- 
tions in which processes (fluency, reasoning) were mixed indiscriminately 
with content (verbal, spatial) factors. , 

The striking thing about Guilford's system, apart from its bold break with 
tradition, is the vast number of ability factors he requires. His system has 
over 120 cells, of which perhaps 50 have been matched with tests. It be- 
gins to be clear that we will never again have a list of a few simple primary 
abilities. According to Guilford (1957, p. 20): 


The obvious implication for intelligence testing is that the trend I 
ward the multiple-score approach and the enlightened composite- 
score approach should be accelerated. The single, somewhat haphaz- 
ardly composed, score has worked well; perhaps too well, hence pee 
unwarranted complacency regarding it. It would seem that we jd 
have information that should make possible a considerable advance " 
refinements of measurement of intelligence. If the apparent complexity 
implied is appalling, what seems to be needed is the courage ie € 
reality. If the next steps do not seem to be clear, then the cure is pé 
knowledge—knowledge concerning the whole list of intellectual facto 
their relations to complex mental functioning, and their rel 
everyday behavior. 


ations t° 


13. Compare Meili's four factors to Guilford's major factors. 2 
14. Where, in Guilford's system, do Thurstone's V, S, and W factors appear 


A FACTOR ANALYSIS OF THE WECHSLER SCALE 
udy 


As a final example of factor analysis in test interpretation, we turn to 2 ae the 
of the Wechsler subtests (P. C. Davis, 1956). This analysis illustrates 
modern technique of using “reference tests” to define factors. 
Davis gave Form I of the Wechsler-Bellevue (very similar to 
202 eighth-graders in Seattle. He wanted to learn as much as possibl 
subtest meanings and believed that group factors would be found qth 
obvious general factor among the subtests was broken up. He predicte P 
presence of particular factors, and for each such factor he introduce 
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or two reference tests into his battery to aid in rotation and interpretation. 
A reference test is one regarded as a fairly pure measure of a factor. The 


predicted factors and their reference tests were as follows: 


Verbal comprehension. Two tests (Nos. 9 and 10 in Table 33). One 
called for choice of a correct synonym, one called for writing of defini- 
tions. 

Numerical facility. Numerical Operations test (4; see Figure 44) calls 
for simple addition, division, etc., under time pressure. 

Perceptual speed. Test (3) requires the pupil to match a figure of 
strange form with whichever of five other figures is exactly like it. 

Visualization. Mechanical principles (6). This test adapted from Air 
Force research resembles the Bennett TMC. Visualization is a factor com- 
monly identified in spatial tests (see p. 279), requiring understanding of 
movements of objects. Test 6 is influenced by both visualization and 
mechanical experience (Guilford, 1947, p. 894) and therefore is a 
rather poor reference test. 

Arithmetic Reasoning (5). Verbally stated problems making little de- 
mand on computation are administered as a highly speeded test. 

Mechanical information. The test (7) asks about a variety of tools and 


machinery. 
In addition, Davis included other conveniently available scores: age, an 
Otis group test, and a test of current scientific information. Finally, he 


adapted parallel forms of three Wechsler subtests for group administration. 
all positive. Davis found ten factors, al- 


The correlations were almost l 
though he had suggested six initially. He rotated them to obtain a simple 
structure, that is, a pattern in which each test has appreciable loadings on 
few factors, and each factor is found in only a limited group of tests. The 
factors are listed in Table 33. ( All Joadings lower than .30 are omitted to 

dings for the Wechsler tests were cal- 


reduce confusion.) Specific factor loa 
usion.) op ; e i 
culated by the present writer, using à rough estimate of error variance. (For 


the other Tes anand specific factors cannot be separated because their 
r tests, err 


E reps us note that this is @ eic analy e the Wechs- 
ler, not the Bi ossible analysis. A somewhat li EE s ruci i a oc- 
cur in a sam P» a a different kind. A different ee 3 mig t choose a 
sligh a I an Indeed, a different KM or rererence tests 

Ehtly different rotatio these differences should be rela- 


her factors. But Pens sois at 
Torga sample and large r er of measures. 


avis i 
ollows: 


mes different from Davis', for the sake of 


tors nal 
i an in this book. 


1 
cog The writer has given some " 
Tdination with other analyses 
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TABLE 33. Factor Analysis of Wechsler Scores with Reference Variables 


c e 
5 $ » 
s $ £ " 
$ 0 5 g v 5 
$ E g 9 E] 5 
C 2 9 & B. 
Tests E a a o = D S e = & g 
Ó £ s» 2 ete tes 
> ts 5 S 2 @ AŠ S 5 2 
22:1: 13111 i 
$ 2 2 $ 2 Poo #2 
1. Age —.32 = 
2. Otis Beta 54 .30 s 
3. Perceptual Speed .30 .52* 
4. Numerical Operations 74* 31 
5. Arithmetic Reasoning A4* .32 
6. Mechanical Principles .39 .64* 
7. Mechanical Information .44* 
8. Science Information T TA .32 
9. Synonyms 75" 31 
10. Word Definition .80* nd 
11. Information—group .61 44 
12. Comprehension—group .38 62 
13. Similarities—group .32 .44 .38 65 
46 
14. W Information E) 56 162 
15. W Comprehension .33 .52 
16. W Digit Span 34 37 
17. W Arithmetic 57 34 36 32 54 
18. W Similarities 31 65 AS 
19. W Vocabulary .60 7 
E! 
20. W Picture Arrangement d .58 
21. W Picture Completion .38 AO 
22. W Block Design Al 30 .38 .44 56 
23. W Object Assembly 442 .34 el 
24. W Digit Symbol 52 37 30 


tS > 
s nce tes 
Source: Based on unpublished data supplied by Paul C. Davis. Asterisks indicate referen 


as defined by 


V—The first hypothesized factor, verbal comprehension, W e- 
a are &SP 


reference tests 9 and 10. Wechsler Vocabulary and Otis Bet ho 

cially loaded with this factor. Three of the Wechsler “Verbal” tests, 

ever, have loadings below .30. n 
VPS—This factor appears in both of the Arithmetic Reasonin 


: erba 

in Comprehension, and in Group Similarities. It might be titled ? 

problem solving. pjects 
NV—Nearly all the tests requiring thinking about numbers OF z jn- 


st 
tead ? 
echa?” 


have moderate loadings on this factor. The common factor seem 
volve some sort of nonverbal reasoning. This factor appeared ins 
the hypothesized factor of mechanical information. The test of m 
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cal information did not have much in common with the Wechsler tests. 

N—This factor has only small loadings on tests other than the refer- 
ence test, Numerical Operations. The high loading of Digit Symbol must 
be due to the marked speeding of both tests, rather than to the fact that 
both tests involve numbers. We call the factor numerical speed. 

I—This is an information or general education factor. 

P—Perceptual speed is identified by the reference test. It appears in 
three of the Wechsler Performance tests. Loadings in two unspeeded 
verbal tests (9 and 13) seem to contradict the interpretation, but the load- 
ings might result from sampling error. 

Vz— This factor is present in the mechanical comprehension test, Pic- 
ture Completion, Block Design, and Object Assembly. It can be only 
vaguely interpreted as some sort of spatial or visualization ability. 

B—— This is a factor defined only by similarity items. If only one of the 
two Similarities tests had been used, it would have been a specific fac- 
tor. Here is an example of changing a specific factor to a “group factor” by 


ar test into the battery. 
hension, Digit Span, and Block De- 


ference test, and any interpretation 


bringing a simil 
F— This is found in Group Compre 
Sign. It is a minor factor having no Te 
of it is speculative. 
P—This unanticipated factor li 
tion, Otis Beta, and Age- It is an 
thing to do with reasoning or education. 


nks Picture Arrangement, Word Defini- 
uninterpretable factor having some- 


All Wechsler tests save Arithmetic have specific loadings over .30. The 
hension, Picture Completion, 


notable specific factors are found in Compre 
Object Assembly, and Digit Symbol. 

What, now, have we learned about the Wechsler test? 

© That if we break up the Verbal, Performance, and Full scale scores, 
we can find a large number of different abilities within the test. There is no 
reason to think that Davis’ ten factors plus four unique factors constitute 
the most refined subdivision possible. 

s That Wechsler subtests very rarely correspon 
simple abilities. No Wechsler test is anywhere near to a p 
commonly accepted reference factor. 

© That it appears possible to estimate individual scores on some factors 
from appropriate combinations of Wechsler subtests. One could obtain mod- 
erately dependable measures of Verbal Comprehension, Visualization, 
Numerical Speed, Perceptual Speed, and Verbal Problem Solving as distinct 
abilities. The other factors are not reliably measured by the Wechsler sub- 
tests. 


d to psychologically 


ure measure of a 
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e That the Wechsler scores include a good deal of information about 
individual differences not observed in the group tests. The Performance sub- 
tests in particular resisted description in terms of factors common to 
other tests. Moreover, the subtests are to an appreciable degree distinct 
from each other. 

What facts about factor analysis are illustrated in this study? 

o That a good deal of variance, probably representing complex integra- 
tive processes, usually remains in specific factors. 

e That different interpreters may disagree as to the psychological mean- 
ing of a factor, but that such disagreement is reduced when a factor 1S 
marked by a well-understood reference test. ot To 

e That the minor factors in a study are usually difficult to interpret. ; 
delineate them clearly it is necessary to include reference variables £OT 
these factors in a further study. 
of .90? 


15. What is the factorial composition of Otis Beta, assuming a reliability t 
a shor 


16. What does the factor analysis suggest as to the best subtests to use In 
form of the Wechsler? hen- 

17. Davis suggests that if only five tests are used in the Verbal scale, Compre ous 
sion might well be omitted. His reason is that it has low loadings on Rumer 
factors. What reasons might justify keeping it in the scale? 

18. Davis names factor VPS “general reasoning.” Why is this name ope 
tion? 


n to ques 


Suggested Readings 
g luc. 

Schutz, Richard E. Patterns of personal problems of adolescent girls. j 7 

Psychol., 1958, 49, 1-5. 4al results 
A factor analysis of a personality questionnaire shows how prep irt 
lead to a different organization of the scores from that arrived at by c!a*' 
items according to apparent content. 

Vernon, Philip E. Mental faculties and factors, and Landmarks in the 2 
of factor analysis. The structure of human abilities. New York: Wiley 
Pp. 1-24. ce performed 

Vernon shows by simple calculations how a factor analysis is P the most 
warns against common misinterpretations, and reviews several © 
important analyses of abilities. 


t 
developme? 


1950. 
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Differential Abilities in Guidance 


THURSTONE intended his Tests of Primary Mental Abilities to be used in 
guidance, hoping that the person's pattern of abilities would indicate the 
courses and jobs where he could expect greatest success. We are now ready 
to examine how validly such patterns can be interpreted, drawing on evi- 
dence regarding the differential batteries whose validity has been most 
thoroughly tested. We begin by describing two of them, the Differential 
Aptitude Tests (DAT) and the General Aptitude Test Battery (GATB). 


THE DIFFERENTIAL APTITUDE TESTS 


The DAT battery was published in 1947, primarily for high-school counsel- 


ing. The eight tests measure aptitudes which previous research had su ggested 
ance. Among the tests are a modification of the TMC, 


spelling test, and a verbal reasoning test. This 
partial list makes it clear that the DAT is quite different from the PMA bat- 
tery. No attempt is made to isolate simple, pure abilities. Instead, the tests 
àim to measure complex abilities which have a fairly direct relation to job 
families and curricula. Measures of proficiency are included because of their 
predictive value. 

The tests require six to thirty minutes of working time. With the addition 
of time for directions, three sessions of eighty minutes each are required for 
the battery. Except for the Clerical test, the tests are essentially unspeeded. 
Items from each of the tests except Mechanical Reasoning and Verbal 
d in Figure 47. (For MR and VR, see pp. 40, 


as important in guid: 
a clerical aptitude test, a 


Reasoning are presente 


235.) 
The publication of this integrated collection marked an important forward 


step in aptitude testing. The counselor desiring tests of this nature previ- 

ously had had to make up his own collection, using tests standardized and 

validated on different samples. Interpretation of profiles was therefore in- 
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NUMERICAL ABILITY. 


1. Add ANSWER 
393 A 7908 
B 8608 
4058 C 8898 
3790 
67 D 8908 
E none of these 
ABSTRACT REASONING. 


Which figure is next in the series? 


ocb | ESTETICTESTEC 


CLERICAL SPEED AND ACCURACY, U i j bol which is 
also underlined at left, Aelia at Hight tis sano 


AB AC AD ag 


aB 
zu BA Ba Bb aA " 
AT TA B7 


7 
JB AB 7B B7 AB TWA ^ 
SPELLING. Which Words are incorrectly spelled? 

apointed 

commission 

visinity 


SENTENCES. Which tua- 
iila spelling? parts of the Sentence are incorrect in grammar, pune 


Ain’t i 
y we / going " the / rw / next week Z E all. 


They / nearly w. j 
A B T es 7 kelor i landed / somewheres in Florida. 
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* Psy” 
à e Differential T š The 
‘ation, Reproduced by Geminis ds Settings (ta sue 194 
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exact at best. Percentile conversions for all DAT scores have been calcu- 
lated on the same sample, so that profile shapes are meaningful. Moreover, 
the tests have been matched in difficulty so that all of them can be applied 
satisfactorily to the same subjects. 

The intercorrelations and reliabilities of the tests are presented in Ta- 
ble 84. Reliabilities are split-half coefficients, except for the speeded Clerical 


TABLE 34. Intercorrelations and Reliabilities of DAT Scores 


Cler- Spell- Sen- 
VR NA AR SR MR ical ing tences 


VR .88 

NA .50 —.88 

AR 51 49  .86 

SR 35 35 49 92 

MR 44 25 48  .3  .85 

Clerical "0 08 .10 05 .04 83 

Spelling 48 36 25 144 16 14 9 

Sentences 53 43 .36 23 26 N 59 .86 
— 2 


Source: Bennett et al., 1947, pp. C-5, C-10. 


test, where a between-forms coefficient was used. These data are for ninth- 
grade boys. It is evident that the tests measure with adequate precision. 
Second, we may note that the tests, except for Clerical, involve a general fac- 
tor. Third, and of great importance, the correlations between tests are 
much lower than their reliabilities. This assures that each test is independent 
9f the others to a substantial degree. 

In order to emphasize the concept of multiple abilities, as distinct from the 
single composite ability commonly measured in previous tests, the DAT 
Originally provided no total or general score. The authors later responded to 
the counselor's demand for an overall predictor by developing norms for the 
Combination VR + NA. This composite serves the same purpose as the 
Broup tests of general ability or scholastic aptitude in common use. 


1 


AT may be given in two, three, or six sessions, 


* The manual ts that the D 
pc E iately. Which arrangement would you 


adjusting the length of session appropr 

Consider wisest? 

Prepare a composition diagram li 

mon and independent elements o 

a. Verbal-Abstract 

b. Numerical-Clerical 

* If a person being counseled has be 
Differential Aptitude Tests would ad 
tion? 

4. In what high-school subjects would 

dict success better than the Abstract 


ke Figure 37 to show the breakdown into com- 
f these pairs of tests. 


en tested with the Wechsler, which of the 
d the most useful supplementary informa- 


you expect the Space Relations score to pre- 
Reasoning score? 
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THE GENERAL APTITUDE TEST BATTERY 


In marked contrast to the DAT in form and function is the GATB. This bat- 
tery was produced by the U.S. Employment Service and is used throughout 
the country for guiding persons seeking work. The construction of the bat- 
tery was strongly influenced both by Thurstone's factor-analytic studies and 


by three decades of research on job performance. Several of the tests are 
descended from the pioneer Minnesota series of voc 
which date back to the 1990's. 


The USES tests are given only through state employment services. The 
tests are often given to high-school juniors and seniors under 
plan which makes the results avail 
and the employment service. Versi 
in at least 27 foreign countries. 


ational aptitude tests, 


a coóperative 
able to both the high-school counselor 
ons of the tests are now being prepared 


The employment services are primarily concerned with guiding the person 
into suitable work. There are thousands of jobs in the modern industrial 
world, each having its own aptitude requirements. When an employer asks 
for referrals of potential employees, he wants applicants who are likely to 
succeed. The USES, working with state agencies, therefore conducts studies 
of the psychological characteristics of particular jobs and accumulates in- 
formation on the meaning of test scores. Dvorak (1956) mentions the follow- 
ing occupations having been studied during a single year: assembler of dry- 
cell batteries, aircraft electrician, teacher, X-ray technician, nurse aid, 
sheet-metal worker, baker, cook, spot welder, comptometer operator, corn 


machine fixer, and fruit packer. Predic- 
nd the academic and reasoning abilities 
ar in this book. 


G—General reasoning ability (a com 

Dimensional Space, and Arith 
V—Verbal aptitude (Vocabulary) 
N—Numerical aptitude (Computation Ar 
S—Spatial aptitude (Three-Dimensional 
P—Form Perception (Tool Matching, F 


Posite of tests titled Vocabulary, Three 
metic Reasoning) 


ithmetic Reasoning) 
Space) 
orm Matching) 
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Q—Clerical perception (Name Comparison) 
K—Motor coérdination (Mark Making) 
F—Finger dexterity (Assemble, Disassemble) 
M—Manual dexterity (Place, Turn) 


(An earlier form had an additional measure of Eye-Hand Coórdination or 
Aiming [A], and factor K was referred to as T.) 

We can skip over Vocabulary, Arithmetic Reasoning, and Computation 
without further description. The Space test is much like the DAT spatial 
test. Name Comparison, like DAT-Clerical, requires quick checking to detect 
discrepancies between two lists. The USES version gives two lists of names 
of business firms, identical except for errors of style and spelling. This tech- 
nique of name comparison was invented for the Minnesota Clerical Aptitude 
Test, one of the earliest successful special aptitude tests. 

Tool Matching calls for rapid visual comparison of pictures of tools, 
alike save for differences in shading. The only reason for showing tools 


rather than abstract forms is to increase the subject's interest. Form Match- 
ard used in the Minnesota stud- 


cut out of a board. The subject 
(see Figure 48). In the USES 
arrangements, and the subject 


img is a pencil-paper adaptation of a formbo 
les in which dozens of irregular shapes were 
Was to fit each shape into the correct hole 
test, the shapes are printed in two different 
must match identical forms. The test appears much like Figure 48, save 


@ <¢€4, L 
HXRIS p 


oard. (Courtesy Educational Test Bureau.) 


FIG. 48. Minnesota Spatial Relations Formb 
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that the shapes are larger. Changing from a formboard to a p 
doubtedly simplified the factor composition of the test by eliminating ri 
terity from the score, and so made it more interpretable as well as easier 
administer. 


Mark Making, a psychomotor test, is likewise designed to meet the de 
of a program which tests a million people each year. The subject is as - 
only to make marks like these 4L in each Square, filling as many squares 4 
he can in sixty seconds. 

The Place and Turn tests are deriv 


ed from the Minnesota Rate of ep 
lation Test. Forty-eight pegs are placed in a pegboard. A second board with 


i d 
rows of holes is provided, and the subject transfers the pegs from one bon? 
to another as fast as possible. In the Turn test, he inverts each peg whi 
transferring it. 


The tests named Assemble and Disassemble call for finer opp qum 
using both hands. A board contains fifty holes. The person is to fit a rivet an 
washer i i 


TABLE 35, Intercorrelations and Reliabilities of GATB Scores for High- 
School Seniors 


Po Q K P 
G—General 85 

V—Verbal — 5485 

N—Numerical — 42 .82 

S—Spatial = D 34 8 

P—Form Perception 43 34 42 48 72 

Q—Clerical Perception 35. 29 42 326 .66 74 

K—Motor Coérdination — 04 13° .06 —03 729 29 76 

F—Finger Dexterity — 05 "o3 T03 01 27 29 "a 5 
M—Manual Dexterity —.06 06 01 r 


efficiency th 
"Paper tests 


ded. 
at has never been excee Tie 
are close to six minutes each. 


3 are 
king time, but several minutes 
ractice. The entire b: 


tor 
i nglish. The psa. 
tests are so designed that eac sll de snes oe heron 
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With its access to workers in all areas of the country, all types of industry 
and agriculture, and most occupational levels, the USES was able to obtain 
a highly representative normative sample. Four thousand cases were drawn 
from the records on hand to form a group in which all occupational, sex, and 
age groups were properly represented in proportion to census data. Scores 
on the factors are expressed in standard-score form, with a mean of 100 and 
s.d. of 20. 

Correlational data for the GATB are presented in Table 35. (These data 
are selected from several tables in the technical manual for the test and are 
not based on the same sample.) We note the usual common factor running 
through the pencil-paper tests, and another factor linking the psychomotor 
tests. In general, the test intercorrelations are low enough to give some prom- 


ise of meaningful separation of aptitudes. 


5. Compare the reliabilities of the DAT and GATB. How much was sacrificed by the 
use of short tests in GATB? What would the reliability of the S score be if the 


test were extended from six minutes to thirty minutes? 
$. How do you account for the overlap of scores P and Q, which seem to involve 


neither reasoning nor dexterity, with the remainder of the battery? 

- Are local norms or national norms most relevant in occupational guidance? 

* The median coefficient of stability for GATB for high-school students is .81, but 
for adult applicants at employment service offices it is .89. Account for this dif- 
ference. (Time between tests is short in both cases.) 

9. Table 36 indicates stability of GATB scores over intervals of several years. 

Which aptitudes are stable enough to be used confidently for ninth-grade 


counseling? Which aptitudes appear to stabilize late in high school? 


oN 


TABLE 36. Stability of GATB Scores 


Correlation with 12th-Grade Scores 
of Tests Given in Grade 
8 9 10 n 
N=53 N=61 N=61 N=53 


ee = IEEE: 
84 


G—General 275 .82 .80 2 
V—Verbal. 70 76 73 82 
N—Numerical 76 77 2 .85 
S— Spatial 76 86 ES E 
P—Form Perception 61 p p 
Q—Clerical Perception 77 : us d 
A—Aiming 55 58 a 64 
T—Motor Speed 59 ál p = 
F—Finger Dexterity 59 7 N E 
.65 65 FA! 73 


M—Manual Dexterity 
published results supplied by Dr. Beatrice Dvorak. 


Sounce: Un; 


Relation of DAT to GATB 


_ One study has applied both D 
lors, and the intercorrelations (Ta 


AT and GATB to the same high-school sen- 
ble 37) shed light on both tests. Each DAT 
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score has its highest correlation with the corresponding GATB factor, som 
that DAT-VR and DAT-NA have higher correlations with GATB-Genera 
than with GATB-V and -N. The general factor has substantial influence in 


TABLE 37. Correlation of DAT and GATB Scores 


GATB Scores 


č v N s P Q T F M 


Form Cler. Motor Finger Manue! 
General Verbal Number Spatial Perc. Perc. Speed Dext. Dext. 


Verbal 78 72 54 54 21 4| 29 20 =o 

a Spelling 66 66 57 21 03 51 .32  .08 bs 

9 Sentences 74 75 56 36 .05 .33 433; AZ *05 

8 Numerical .66 .52 .62 32 .01 .22 27 13 rd 

E Abstract 68 48 45 56 14 26 21 7 o 

á Space 59 49 24 Hn 23 22 ig — .35 jp 
Mechanical — .62 — .56 — .25  .68  .13  .09  .24  .39 p 
Clerical 25 38 33 o7 46 .53 G M - 

The values 


2 » similar, but 
given are those for high-school boys; correlations for girls are similar, 
generally lower. Correlations over .50 are in boldface type. 


Source: Guide to the Use of GATB, 1958, p. L-1. 


d 
every DAT score except Clerical, which correlates with all the GATB spee 
tests. 


The GATB factors P, F, and M measure aptitudes not covered in the m 
battery. DAT-Mechanical Reasoning has no counterpart in the indc] is 
though it overlaps G and S to a considerable degree. DAT-Spelling and Se 
tences overlap considerably with Verbal Reasoning. 


han 
10. For what types of guidance does the content of GATB seem more useful 


that of DAT? For what types is it less useful? 
11. What do the correlations for DAT-Clerical tell about its meaning? 
12. DAT and GATB spatial tests c 


ith 
orrelate .72, but each correlates only -50 m 
PMA-Spatial. How do you account for this? hese 
Make composition diagrams to show the overlap and unique content dm 
pairs of tests: 
a. DAT-NA and GATB.N 
b. DAT-MR and GATB-S 
14. Why does MR have a large 
no such factor in Table 30? 


13. 


=i owed 
spatial loading here, when a similar test sh 


SPATIAL ABILITY 


We cannot examine separately the psychological and practical significan? 
of every factor so far isolated, or even of all the scores in the test pdt 
lected spatial reasoning and mechanical c? a 
close attention. After reviewing evidence 


under discussion. We haye sel 
prehension as examples for 
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these two factors, we shall return to a general discussion of the test batteries. 
Psychomotor abilities will be further considered in the next chapter. 

Spatial ability was present in some early nonverbal tests of general ability, 
but it was soon recognized that tests calling for comprehension of form re- 
lationships were not measuring the same thing as tests like Picture Arrange- 
ment which required comprehension of ideas. Early investigators of voca- 
tional aptitudes identified a number of jobs which seemed to require facile 
reasoning about forms, and spatial tests have since played a part in nearly 
all research on vocational aptitude. The DAT manual speaks of Space Rela- 
tions in this way (Bennett et al., 1959, p. 7): 


The Space Relations test is a measure of ability to deal with concrete 
materials through visualization. There are many vocations in which one 
is required to imagine how a specified object would appear if rotated in 
a given way. This ability to manipulate things mentally, to create a 
structure in one’s mind from a plan, is what the test is designed to evalu- 
ate. It is an ability needed in such fields as drafting, dress designing, 
architecture, art, die-making, and decorating, or wherever there is need 


to visualize objects in three dimensions. 


There appear to be several distinct spatial abilities. Comprehending static 
objects (as in Block Counting) seems to involve something quite different 
from visualizing how an object or machine will look after certain movements 
take place (Guilford, 1947, pp. 269-296; Michael et al., 1951). A visualiza- 
tion factor (Vz) is found in tests such as Binet paper-folding and in some of 

hurstone's tests where the subject must visualize how a figure will look 


when rotated. 


Validity in Educational Prediction 


One might expect spatial ability to be relevant to high-school courses 
Such as geometry, shop, and engineering drawing. Validity coefficients for 


Many schools are available in the DAT manual, some of which are reported 
nts are also given for Numerical Ability 


1n Table 38, For comparison, coefficiei 
and Abstract Reasoning. The coefficients reported are based on boys, but 


Tn : Sie? 
esults for girls are similar. 


Looking first at the correlations for geometry, we see that results from one 


teriously. The two White Plains sam- 
] in the same year, but coefficients in one class 


are strikingly higher than in the 0 
ne school in a well-defined cours 


Validity are hazardous 
In all schools, SR has positive relations with geometry, but NA is a better 
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predictor. Insofar as we can judge from these coefficients, the contribution of 
SR to prediction of geometry is accounted for by its general-factor content. 
Other spatial tests show similar results. Though geometry undeniably re- 
quires reasoning about forms, tested spatial ability accounts for little of thé 
variation in geometry marks. Here again we encounter evidence warning 
the test user against trusting his judgment as to what a test is likely to pre- 
dict. 

Note also, from this example, that the importance of a test cannot be 
judged solely from its correlation with the criterion. Considered alone, spa- 
tial ability has modest validity. Considered alongside other predictors, we 
find that the predictive value of the test is due to its general-factor content. 


TABLE 38. Some Validity Coefficients for Differential Aptitude Tests 
Against Course Grades 


Correlation of 


Time Be- Number Marks with " 
tween Test of Nu- A t 
Course Grade Location and Marks Cases Space merical strac 
Plane Geometry 10 St. Paul, Minn. 1 year 48 32 A7 ae 
10 White Plains, N.Y. — 1 year 70 .20 34 '56 
10 White Plains, N.Y. 1 year 77 53 57 yt 
Solid Geometry 12 Baltimore, Md. 1 year 47 3 33 25 
12 Hamilton, Ohio 1 semester 42 .18 .61 16 
Art 8 Yonkers, N.Y. 1 year 471 20 23 PT 
9 Worcester, Mass. 1 semester 44 34 Al 1 
Mechanical 
Drawing 10 


Gloucester, Mass. year 46 .02 PA 43 


1 
10 Independence, Mo. 1 year 44 57 49 2 
Shop 9 Worcester, Mass. ] semester 142 26 27 “4 
8 Yonkers, N.Y. 1 year 41 18 28 ^ 
10 Independence, Mo. 3 months 42 07 .06 d 


8 Schenectady, N.Y. 1 semester 81 .33 28 50 


Source: Bennett et aL, 1959, pp. 42 f. 


. H i 
The essential question about the practical value of a test is how much * 
adds to what other measures can tell. 


The remaining coefficients in Table 38 tell the same story: variation from 


class to class, generally small positive correlations of SR with the criterion: 


equally good correlations for nonspatial tests. These data, and data on othe? 
tests, point to the conclusion th i 


ict 
at spatial abili se, predio 
Success in high-school courses, B dandis! pe an 


A study of college mathematics 
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subject is to locate in the second picture the "aiming point" toward which 
the prow was pointed in the first scene. The second, Spatial Visualization 
(Vz), requires the subject to identify how a clock will appear when tilted 
and rotated in a sequence of movements described verbally. Hills found con- 
sistent correlations of S with criteria in several mathematics courses for 
engineers, coefficients being as high as .55. In courses for physics and 
mathematics students at the same level of mathematics, however, S had 
negligible validities. Hills also found that the relevance of the factor to a spe- 
cific course (e.g., calculus) depends on how the course is taught. Validities 


ems. The subject is to mark whichever an- 
ted by the bar) in relation to the original 
re C, B, ond E respectively. (Copyright 1947, 


FIG. 49. Guilford-Zimmerman Spatial Orientation it 
en shows the position of the boat's prow (represen 
SING point (dot). The answers to the three items a 
heridan Supply Co., and reproduced by permission.) 

n the engineering sections but were 
s students. S gave a larger number 
any other of the reasoning factors 


for Vz were much smaller than for S i 
Consistently larger in sections for physic 


of substantial validity coefficients than $ 
tested, Hills’ results hint that special abilities may be more valuable as dif- 


ferential predictors in advanced courses than in high school. Special abilities 
Contribute little to prediction of overall grade averages, since no ability save 


ve x 
thal or numerical affects many courses. 


15. Give possible explanations for the differences between the two White Plains 
samples in Table 38. PET. a = 

16. How can one explain the negligible importance of spatial ability in predicting 
geometry? 


Occupational Validity 
s in vocational choice and employee 


The chief value of spatial tests i s 
for example, indicates a marked 


Selection. A study of watch repairing, E. 
Correspondence between spatial ability and performance, the validity co- 
efficient being .69 (Bennett et al., 1959, p- 65): Ghiselliisismmmary: GE pubs 
lisheq reports (1955) shows that spatial relations tests have predictive 


Validities averaging greater than 30 for either training success or job 
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proficiency in protective occupations, service Goeupations, (iini et 
men, electrical workers, structural workers, processing workers, opera 
x machines, and gross manual workers. . 

tdem ist. information on vocational correlates of = 
ability is provided by the USES. Table 39 gives a sample of kovig pon 
along with data on General, Form Perception, and Manual Dex z een 
These data were gathered on persons working in or training for e» Et d 
tion who had already been selected to some extent, as is shown r rdanly 
that the mean score departs from 100 and the s.d. is below 90. d ved 
only persons very superior in space ability get into engineering E = he 
courses. Drill-press operators, at the other end of the scale, are pone " 
the below-average workers who remain after those with better am ca 
siphoned into other jobs. The validity coefficients for occupations w in) 
s.d. is low would be much larger if an unselected group bad kon ian 

Spatial ability is also important in several of these occupations. en E d 
eral and spatial ability contribute to success as draftsman or tabu ` pom "i 
chine operator; dentists, engineers, and machinists need form viria sigt 
addition. Careful distinction between aptitudes is important for jo i a 
ments, Although S and P are both, in a sense, "spatial" S is on á aik 
dentistry lecture courses while P is not. For bomb-fuse assemblers, the kr a 
perception tested by P is much more important than the reasoning a und 
S. The radio-tube mounter is likewise engaged in assembly of small pe 
but his success depends on dexterity, not on S or P. 


: T s not 
Few of the correlations in Table 39 are large. Spatial ability alone doe 


TABLE 39. Validity of GATB-S Against Occupational Criteria 


Number Slane 
of Spatial Aptitude Correlations 

Occupation Cases Criterion Mean sd. f G p 
m e e e —,18 
Dentist 96 Lecture grades — 132 14 29 24 —.02 

89 Laboratory 3 4 
grades — — .33 43 3 unknown 

Engineer 150 School grades 134 15 aM -42 JH i 
Draftsman 40 Ratings 126 12 32 42 06 ‘08 
Machinist 71 Ratings 114 18 37 029 X 
Tabulating-ma- 10 

chine operator 203 Ratings 106 18 .20 34 10 t 
Bomb-fuse parts 31 

assembler 90 Ratings 102 15 112 21 33 
Mounter (radio Production 54 

tubes) 100 records 101 14 —.02 03 —02 32 
Upholsterer 49 Ratings 17 43 24 25 E 
Poultry laborer 72 Ratings 16 .03 -24 09 i 
Drill-press Production AT 

Operator 31 records 18 05 32 2 

Values in boldfa, 


OURCE: Guide to the 


ce are significant (P > 05 


). 
Use of GATB; 1958, III J. 
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account for success in any of these jobs. Taking all the aptitudes into account 
Simultaneously, however, can greatly improve employment decisions. In 
Chapter 12 we shall explain some of the procedures used to combine apti- 


tudes into a selection formula. 


17. Explain why the validities of the GATB tests are different for the two dentistry 
criteria, 


MECHANICAL COMPREHENSION 


We have previously discussed the Bennett TMC, which is the prototype for 
the Mechanical Reasoning Test of the DAT battery. No mechanical compre- 
hension test was included in the USES battery, on the assumption that other 
tests in the battery cover much of what such a test would measure. As Table 
87 showed, DAT-MR correlates about .60 with tests of G, V, and S. Factor 
analyses of an Air Force test patterned after the TMC, however, indicate 
that about 35 percent of its variance comes from Mechanical Experience, 
25 percent from Visualization, and only 12 percent from G, V, and S com- 
bined ( Guilford, 1947, pp. 336-339). These reports are less contradictory 
than they perhaps appear, since each analysis is based on different test bat- 
teries and statistical procedures. The Air Force analysis is the more satis- 
factory, being based on far more data and reporting correlations with factors 


rather than with single tests. 
The validity coefficients for 
School marks run a bit lower tha 
the median correlation of MR with science gr? 
to VR, 54; NA, .52; AR, .42; and SR, .34. (See 
Tesults are obtained for the Multiple Aptitude Tests, 
tery, 
_ Adaptations of the Bennett test h 
‘an and military technical specialties 


mechanical comprehension against high- 
n coefficients for other abilities. In the DAT, 
des for boys is .40, compared 
also Table 11.) Quite similar 
another high-school bat- 


ave frequently predicted success in civil- 
(Bennett and Fear, 1943). The British 


Army found that a form of the Bennett test had a validity of .59 for selecting 
truck drivers; no other test was nearly so good (Vernon and Parry, 1949, p. 
230), Among the average validity coefficients calculated by Ghiselli (1955) 
9n the basis of the published literature, mechanical comprehension had the 
following notable validities for either training or job-performance criteria: 


workers, and assemblers 


50 to .59 machining workers, bench k : 
Jectrical workers, processing workers, 


-40 to .49 protective occupations, € 
rs, inspectors 


complex-machine operato : 
30 to .89 a repairmen, welders, vehicle operators, structural work- 
ers 
e study in which a factor analysis of 


£ particular is an Air For 
interest 
dent factors consider ed, the two 


Pilot success was made. Out of 26 indepen 
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most significant for pilot success were Spatial and Mechanical Experience 
(followed closely by Integration, Visualization, Psychomotor Coordination, 
and Pilot Interests; Guilford, 1947, p. 843). The Mechanical Principles test 
had a validity of about .35 as a predictor of pilot success. 

A second type of mechanical test requires subjects to identify pictures of 
tools and is thus a measure of acquaintance rather than understanding. A 
recent test of this type is illustrated in Figure 50. The subject is to find the 


FIG. 50. Part of oe Mellenbruch test. (Copyright 1957 by Psyc 


lettered picture that 
field 


l a 
e goes with each numbered picture. Knowledge about 
may be re 


or les 1 garded as an indication of interest in it, if people have ei 
Fe «e = Opportunities to get such information. Verbal tests of informa- 
t out machinery, medicine, current events, sports, etc., may be id 
In vocational prediction. 


The US. E i 
Bor Sa Ein Sa Service has done much to develop trade tests for 
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employment center, eliminates those who might otherwise be shipped 
across the country to a plant where skilled men are needed. Trade tests are 
also used in military classification to check whether men are qualified in the 
trades where they claim civilian experience. In the British Army such tests, 
because of their reliability, were sometimes more dependable bases for as- 
Signing men than records made in training courses (Vernon and Parry, 
1949, p. 244). 

Trade questions are selected to cover job processes and tools. Questions 
that would be unfair because of regional differences in methods of work or 
vocabulary are eliminated. To check item validity, three criterion groups 
are tested: expert workers, beginners in the trade, and workers in closely re- 
lated trades. The items which discriminate these groups are retained. 
Items from several tests are (Stead et al., 1940): 


(Carpenter) What do you mean by a “shore” in carpentry? Ans. Upright brace. 
(Plumber) What are the two most commonly used methods of testing plumbing 


Systems? Ans. Water, smoke, peppermint, air (any two). 
(Asbestos worker) In stitching canvas covering over pipes, where is the seam 


run? Ans, Out of sight, back or top of pipe (either). 


A good trade test discriminates between novices, apprentices, journeymen, 
and experts. In Figure 51 we see how a test of engine-lathe operators func- 
tions. Such a distribution of scores permits one to classify a job applicant 
With little error; a score of 22 almost certainly indicates a journeyman. 


eee 


I 
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25 30 35 40 
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FIG. 51. Scores on a trade test for engine-lathe operators (Burtt, 1942, 


p- 493). 


18. a. The Purdue Assembly test is designed to include mechanisms using each im- 
portant mechanical device: gears, levers, rack-and-pinion, etc. Does such a 


test assume that mechanical aptitude or comprehension is a single general 


ability, or that it is a group of specific abilities? 
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b. If the latter theory is true, what implications does it have for selecting stu- 
dents for training in watch repairing? 


19. Boys surpass girls on the Bennett test. How may this finding be explained? 

20. Mellenbruch reports validity coefficients ranging from .50 to .60 for his me- 
chanical aptitude test. The criteria used are teacher's ranking of engineering 
drawing trainees (women), experience in mechanical activities, and scores on 
the Air Force Mechanical Information Test. What other validity studies are 
needed to support his recommendation that those scoring low on the test should 


not be hired for mechanical work or should be placed only in routine mechani- 
cal jobs? 


THE INTERPRETATION OF APTITUDE PROFILES 


Differential ability tests are used in two ways: for institutional decisions and 
for individual decisions (Cronbach and Gleser, 1957). An institutional deci- 
sion is one in which a factory, a school, a military organization, or the like 
selects and assigns individuals in order to obtain the best total result, i.e., the 
greatest possible attainment of institutional goals. This use of the tests rests 
primarily on efficient statistical combination of scores rather than on psy 
chological interpretation. An individual decision is one which secks to pro- 
mote the welfare of one person, considered by himself. In career guidance, 
for example, the emphasis must be on psychological interpretation. We shall 
concern ourselves here with the use of profile information in individual de- 


cisions, and turn to institutional decisions in Chapter 12. 


Perhaps there was once a hope among counselors that a test profile 


would permit a definite, final choice of vocation at the time the tests are 
given. If this were the case, the counselor and client together could reach à 
decision, and the client could rely on the counselor’s interpretation of the 


tests. Today it is recognized that the client himself must fully understa™ 
the test results, for two reasons. 


" stieg for 
3 4 provide opportunities 
him to explore and develop aptitudes and interests. In an EA econ- 


omy, workers change position or change responsibilities within the same €57 
tablishment. The engineer in a technical firm, for exam le, may become ? 
manager, a salesman, a creative designer, or an expert s ded specifica" 
tions. Wise choice requires self-understanding; no “prescription” filled out 


by a tenth-grade or freshman-year counselor can anticipate these subse- 
quent decisions. Test interpret 


ation i : Jis 
iun lon is only one step in a long process of se 
Secondly, 


the client is more likely to accept recommendations which he 
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understands. The counselor may be convinced that a freshman should get 
out of engineering and into advertising. Even though advertising is con- 
sistent with the boy’s talents and interests, he may resist or ignore the recom- 
mendation, If he has been visualizing himself as an engineer for years, such 
a change of program requires him to alter his entire self-concept and may 
seem like an admission of defeat. To accept the new goal requires that he 
understand the facts the counselor considers significant. Acquiring a new 


self-image requires both factual and emotional learning. " 
The counselor must decide what meaning may justifiably be n 
from scores and must at the same time consider how this information is to be 


communicated so that it affects the clients conduct. 


limitations on Interpretation 


aptitude measures has definite pre- 


A general abili or a battery of e p 
= sete: A S , ame time, the scores have distinct 


dictive value, as we have seen. At the s 


limitati i nbered. 
ations which must be remen 
Profile Shape as a Function of the Norm Group. It is necessary to use test 


norms in order to plot a profile, and the choice of norms ptum nien p 
file shape. Profile shape changes when a different iry i ond z 
most common example arises in interpreting mechanical compr sio 
Scores for girls (see p. 92). 

The USES profile is ordinarily plotted 
The profile (Figure 32) of a (hypothetica 


against norms for adult workers. 
1) student engineer plotted in the 
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files for the same student engineer. 


FIG. 52. Two GATB pro 
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usual manner (upper profile, Figure 52) draws attention to X qon 
G, V, and S abilities, and shows him near the average in dexterity. E 
stead we use a standard-score conversion from data on engineering studen " 
his profile (lower profile, Figure 52) takes on a strikingly different € 
ance. His greatest strength, relative to other engineers, is V. Ti S, he : J 
above average; he is average in G, and behind the group in d 
It is important to compare the person with the group he will associz 
with and compete against rather than with 


" 
"people-in-general. 
Precision of Measurement. 


When we try to measure several aptitudes in : 
short period of time, reliability coefficients often drop to .75 or ut x 
with high reliability, a retest would show enough change to suggest differ : 
recommendations for a certain number of persons. These random hero 
though present, do not cause much concern when tests are used for A: 
tional decisions. Even if a test is seriously wrong in 10 percent of the cas " 
the decision maker reaches correct conclusions far more often than he kp 
with other data. If an unintelligent man slips by an Army screening d al 
can be detected later and discharged at no enormous cost. In an indivi 5 ; 
decision, however, we cannot be content with a small rate of error. Qpe 5 
Tor may alter a person's entire life if the test leads him to decide, for exa 
ple, not to continue his education. 


m ipe 
Suppose it is known that 70 people out of 100 having IQ 110 fail in a ce 


i i redicti - Walter, 
tain profession. The counselor cannot make a clear prediction for 


+ qualities 
1Q 110. Perhaps he would do better if tested again. Perhaps other aT 
unknown to us make Walter one of the 30 who would succeed, rather 

of the 70 who fail. Almo. 


st never are psychological tests so valid that a predic 
tion about a single case is certainly true. 
The counselor who i 
reduce its ill effects, 
consistency. If in do age 
comparable test, He examines his case for special factors such as languag 
difficulty which migh 


test performance as p 


e in 
at, but not extremely, below oa ai 
atement “Walter is at the 32nd percentile o 


: E à 
is almost certainly untrue, in the sense that further da 
would not Precisely confirm it. 


Clients and suc i 


arefully qualified, the person int g 
nly portions of it. A parent, p He 
€ tester’s cautions about what the 
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does not measure, the possibility of growth or decline in IQ, and the approxi- 
mate nature of predictions from it. Instead, the figure itself may stick 
vividly in mind and be used as a basis for significant decisions for years to 
come. 

Profile Reliability. Special difficulties are encountered in interpreting multi- 
Score tests where judgments rest on differences between scores. Such differ- 
ences are usually much less reliable than the scores themselves. DAT-VR 
and DAT-NA are reliable, for example, but they overlap, and much of the 
reliability of each score is due to the overlapping part. When that is sub- 
tracted, the remaining test variance contains a high proportion of error 
(Thorndike and Hagen, 1955, p. 178). 

The reliability of a difference between two standard scores A and B is cal- 
culated by this formula: 

_ TAA + Tro — 2ran 


Tid- A-B) " 
(A-B) C 2 mE 9ran 


When tests have low reliabilities or a high degree of overlap, the difference 
is highly unreliable. Using the data of Table 34, we find that in the DAT the 
reliability of the VR-NA difference is /16. That of VR minus CSA is .82. 
Small differences are generally chance effects. When a difference be- 
Comes twice as large as its standard error, there is only one chance in 
twenty that the person is equally good on both tests. We can have substantial 
Confidence that a retest would confirm such a difference. Table 40 indicates 
TABLE 40. Interpretability of Difference Scores 


Proportion of Subjects Showing In- 


i ce in 
Diferen terpretable Difference if Test 


Average T-Score Units : 
Reliability of Required for nieces ls 

Profile Scores Interpretation .00 a i 75 

A] 66 61 53 38 

30 Fr 54 47 38 22 

‘80 12.5 37 31 21 8 

; 28 21 13 3 


ed as one which would occur only 


In this i ble difference is defined a: 
one time i nt) puna persons whose two abilities are actually equal. 


how large a difference must be to allow this degree of confidence. For 
DAT-vR and -NA, the average reliability is near .90. The table tells us that 
difference between these scores must be at least 9 points to be significant. 


A difference smaller than that indicated by the table should be regarded only 
ed by other data. If two tests are highly corre- 


fferential measurement. 


Profile then is not very useful for di 
g thought to ways of reporting scores 


Test developers are giving increasin 
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so that their unreliability will be kept in mind. One device is the report 
for an educational achievement test shown in Figure 53. Here the en à 
score is shown, not as a point on the scale, but as a range within Whig a 
ability almost certainly falls. The width of the band is twice the standa 


low 


Mathematics 


Science 


FIG. 53. Profile for the Se 
areas for Mathematics and 
ence in standings on these 
ence. However, the shaded areas for S 


quential Tests of Educational Progress. The shaded 
Social Studies overlap; there is no important differ- 
two tests. The same is true of Mathematics and Sci- 


cience and Social Studies do not overlap. 
The student is higher in Social Studies than in Science ability, as measured by 


these tests, (Copyright 1958, Cooperative Test Division, Educational Testing 
Service, and reproduced by permission.) 


error of measurement. The student can 
persons in ten sur 


tween social studi 


1w 
see from the profile that about ao 
pass his mathematics score, and that the difference 
es and mathematics is not reliable. 


21. Calculate the reliabilit 


in 

Y of a difference between Spelling and Sentences ! 

DAT. 

22. Which pair of GATB tests appears to have the least reliable difference? Com 
pute its reliability, 


23. Examine the DAT profile in Figure 16, Which score differences are reliable 
enough to interpret? 


24. In the PMA tests for a 


ges 5 to 7, the correlation of V with S is .60. The iai 
bilities are .77 and 86, respectively. What can you say about the interpr 
ability of the v.s difference? 


Stability of Aptitudes. Vocatio 
success far into the future. 


justment to schooling is com 
cialized aptitudes emerge? 
The DAT is designed for use as 
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Sentences, Numerical, and Mechanical; and near .60 for Space, Abstract, 
and Clerical (Bennett et al., 1959, p. 68). Much of this stability is no doubt 
due to the general factor. Similar results were reported for GATB (Ta- 
ble 36). 

The real question is the stability of differences within the profile. When 
ninth-grade differences were correlated with twelfth-grade differences, the 
correlations ranged from .20 (Numerical minus Abstract) to .74 (Mechanical 
minus Spelling) (Doppelt and Bennett, 1951). Differences among Clerical, 
Mechanical, and the overall level of the verbal-language-numerical tests are 
Stable enough to be taken seriously. It is doubtful if long-range predictions 
can be based on Space scores in Grade 9, or on differences between Verbal, 
Numerical, and Abstract scores in that grade. 

In view of the inadequacy of present data on the stability of factor scores, 
a firm conclusion cannot be reached. The following statement is a “best 
Buess" as to what more complete research will show. Special ability tests 
may have some use for short-term prediction and classification in elemen- 
tary school. This is suggested by Reed's finding (1958) that the PMA spatial 
Score (visual discrimination) correlates A41 with achievement in primary 
reading whereas verbal ability correlates only .27. At higher grades, V cor- 
relates 52 and S only .18, a finding which reflects the shift in teaching em- 
Phasis, after basic skills are established, from perception to comprehension. 


Special ability tests are often relev 
dial help. For most elementary pupils, 


measures of verbal and nonverbal ability 
ons for instruction are unknown. 
ential prediction before Grade 11. In 


ng points so that the pupil will be en- 
se assets will be developed. In these 


ant in studying children requiring reme- 
guidance is best based on general 
rather than on more elaborate pro- 


files whose implicati À 
There can be little long-range giter 
Grades 7-10, aptitude tests suggest stro 


Couraged to enroll in courses where the cud ie 
; " yes 
grades low scores need not be considered seriously save where, as in Num- 


ber and Spelling remedial instruction can raise the score. By mid-adoles- 
g, 


cence, the individual's aptitude pattern is reasonably stable. Erei at this 
age, irreversible decisions should be avoided. Later courses and job experi- 
knowledge of his aptitudes. 


ence will add greatly to the students kno ; b — 
Meanings Attributed to Scores. For counseling, scores must be explained in 


common-sense terms. The client will continue to face choices between 
Courses and between job openings, and the counselor cannot possibly give a 
recommendation that will anticipate all such questions. He must help the 
Client to und erstand his own profile and to understand what tasks the various 


aptitudes are relevant to. 

The DAT and GATB profiles 
Labels like Numerical Reasoning 
inborn aptitudes; they are clearly me 


are well designed for such interpretations. 
and Spelling do not sound like mysterious 
asures of a certain type of performance 
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The safest way to interpret scores is in terms of the items that constitute the 
test; i.e., "This score shows that you do well on problems like this.” Any 
more elaborate interpretation leads quickly to misunderstanding. Mechani- 
cal reasoning is misinterpreted as "mechanical aptitude" though the test 
clearly does not cover dexterity. The clerical test is misinterpreted as a pre- 
dictor of success in stenography and typing whereas it actually covers rapid 
checking of details, important only in very routine office jobs. The student 
may connect spatial ability to art, geometry, and shop courses even though 
the validity coefficients discourage such an interpretation. 

Some degree of vagueness is absolutely essential. The student should be 
made to feel that he can improve many of his aptitudes. He should regard 
the test findings as hints to be checked in other experience. Nothing in our 
experience with testing justifies making firm individual decisions on the basis 
of differential abilities. 


The case of Sarah Carrell provides an illustration of many of the comments 
we have made (Bennett et al., 1951). 


Early in her junior year, Sarah talked over her test scores with the counselor. 
Her school work had been satisfactory. She then appealed for help in persuading 
her mother that it was worth while to finish high school. The mother wished her 
to go to work since her father had been forced to retire on just a small pension. 1 he 
mother felt Sarah was over-age (illness in childhood had retarded her one year), 
and that she would not do well in secretarial training because her school grades 
were not above average. Moreover, none of Sarah's older sisters had graduated 
from high school and the mother considered high school of little value for a girl. 


Sarah's DAT profile showed that she fell in the middle range of bigh- 
school juniors. Her Spelling and Sentences scores were her lowest, at the 
25th percentile. Her peaks were Numerical (75th percentile) and Abstract 
(70). All other scores were at the median. In Grade 9, a reading test had 
placed her at the 58th percentile, and the Otis group mental test in Grade 8 
at the 47th percentile, All these agree with the DAT in indicating that Sarah 
had enough ability to finish school. 

The test record was useful in showing Sarah’s mother that the girl Ra? 
Superior in numerical and abstract performances. The counselor pointed out 
that Sarah could expect to do well in calculating and bookkeeping, which she 


could take if she stayed in school. (NA, AR, and Sentences are the best pre 
dictors of bookkeeping marks 


; -) “The mother,” says the counselor, “then ad- 
<a that her secret desire had been for Sarah to work in an insurance 
a ra "| brother-in-law could secure her a job. She conceded that 

Sarah wa. i Eo um ought to have a chance to finish school". 

s deficient in language usage, and the counselor should point to 
ce work. If this deficiency is repaired, as may well hap” 


pen when study is motivated by a definite goal, Sarah could qualify for al- 
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most any office job at a modest level of responsibility. If this deficiency re- 
mains, the test analysis has shown that her best opportunity for success is 
in bookkeeping or the like. 

The DAT scores of Robert Finchley (Figure 16) contradict his scores on 
other tests. His Otis score was at the 55th percentile, his reading speed at 
the 24th, and his comprehension at the 50th. But on the DAT, he had these 
percentile scores (Bennett et al., 1951): 


VR NR AR SR MR CSA Sentences Spelling 

95 95 97 92 95 10 14 9 

ates, and his sister has a good school record. 
all his school years, and in high 
k. The DAT had been given 
t with Robert, or 


His parents are college gradu 
Robert’s record had declined steadily during 
school he was doing little of his assigned wor! 
routinely in Grade 10, but no effort was made to discuss i 
even with his teachers, until a year later. 

The story of the test scores is clear: outstanding overall ability, with a se- 
vere deficiency in clerical speed and in language usage. From the case his- 
tory it appears that Robert's teachers had begun to regard him as a mediocre 


Student who could not be expected to do well, and that he had come to share 
penly delighted with the test report 


their opinion. Robert himself was © à 
and put forth more effort as he regained confidence. He became interested 
gineering. The record suggests a 


in obtaining information on schools of en 
need for remedial reading, but this could perhaps better be added to 
Robert's schedule after he gets his current work in hand. 


25. Would the GATB have given valuable information to supplement Sarah's DAT 


scores? 
a Why did the Otis test fa 
7. Is engineering the most suitable goal for : 3 
28. Interpret the profile of Ellsworth Newcomb. He has been preparing tor engi- 
neering, but is making C's in mathematics. His tested interests are in verbal and 
personal-contact activities. He has done some selling, with success. On the OSU 
test, he scores at the 69th percentile of college freshmen. His DAT percentile 
Scores in grade 12 are: 
Verbal Numerical Abstract Space Mechanical Clerical VORNE 1 


86 48 44 — 40 36 13 73 93 


il to reveal Robert's superiority? 
Robert? 


IMPORTANT GUIDANCE BATTERIES 

ng tests so as to maximize the sub- 
this point we provide a summary 
available to the tester. Comprehen- 
been compiled and reviewed by 


There is more to be said about interpreti 
Ject's insight, but before expanding on 
listing of some differential batteries now 
Sive information on these batteries has 
Super (1958). 
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€ Differential Aptitude Tests; George K. Bennett, Harold ws met 
Alexander G. Wesman; Psychological Corporation, 1947, 1952. pe s 
college. A well-constructed set of eight tests described on pp. " Ens 

e Flanagan Aptitude Classification Tests; John C. Flanagan; viet 
Research Associates, 1958. High-school seniors. A seven-hour battery iem 
tests suggested by Air Force factor analyses. In addition to the - pend 
aptitudes there are tests for ingenuity, tapping, speed of scale a i B c 
ing skill, etc. The validity of the tests is still under investigation, à = 
should be restricted to research use at present. In particular, the cae 
tional scores” obtained by combining tests -- not be used until s 
factory evidence of their validity is provided. 

e A T eai Aptitude Survey; J. P. Guilford and icu : 
Zimmerman; Sheridan Supply Company, 1947. Measures Y, R, N, uos 
Vz, Mechanical knowledge. Based on factors found useful in n us 
classification. Contains several unique tests which may have pre nia 
value, but evidence on predictive validity in civilian tasks is not avai 
Primarily for research use at present. . 

e si deer sac Uni-Factor Tests; Karl J. Holzinger and Nor en 
Crowder; World Book, 1955. Grades 7-19. An excellently constructe Men 
which in one hour gives measures of V, S, N, R, and a composite von ow 
of scholastic aptitude. Reliabilities range from .80 to .90, with unusua T rius 
intercorrelations, Speed is of some importance in the short subtests. p dif- 
overall grade average very well, but the value of the factor scores for 
ferential prediction appears to be quite limited. . lifornia 

* Multiple Aptitude Tests; David Segel and Evelyn Raskin; Cali e! 
Test Bureau, 1955. Grades 7-12. Nine tests in three hours cover vocabu fen, 
reading, language usage, clerical, arithmetic, mechanical comprehen ra 
and spatial abilities. Test scores may be combined into arp 1 =a 
correlated scores for V, P, N, and S. The battery is technically satis i : ce 
and uses tests of familiar types which can be interpreted by «n»n e 
counselors. Differential validity for course grades is not great, and occur 
tional validities are not established. 

9 Tests of Primar 
Research Associ 
several of Thur. 


lower grades 


prediction. The one-hour battery for ages 11-17 mew 
computation, fluency, space, and reasoning. Until evi fine! 
eaning of profiles is available, the tests should be con p^ 
Incautious and incorrect claims have been made for 
Anastasi, 1954, pp. 114, 865-368, Super, 1958, p. 87). 


ures vocabulary, 
to indicate the m 
to research use, 
PMA tests ( 
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HELPING CLIENTS USE TEST INFORMATION 


Client-Centered Counseling 


In earlier days of psychological service, the counselor was often viewed 
e category as an engineer inspect- 


as an expert passing judgment, in the sam 

ing a bridge or a physician prescribing for a disease. The modern view is 
that the counselor does not decide or direct, but rather helps the client think 
for himself. In extremely directive or prescriptive counseling, the "expert" 
Obtains facts, decides, and tells the client what to do. So-called “client-cen- 
tered” counseling stresses the importance of the client’s making his own de- 
cisions. This point of view, formulated by Rogers (1942), emphasizes that 
the important goal is the growth of the client toward maturity and adjust- 
ment. A person who has learned to rely on his own judgment has been 
helped more than one who must seek advice in each new crisis. 

Expert advice often fails because factual questions are entangled in emo- 
tional attitudes. The true problem is often not the surface problem voiced to 
the counselor. Suppose Stan Howard, employed on a finishing machine, 
Comes to inquire why he was not promoted to foreman. The directive 
Personnel manager might give the facts, based on tests and ratings, which 

prove" that he would make a poor foreman. He may even give Howard a 
Pep talk on how well he produces, about the chance of raising his pay as a 
Workman, and about the undesirability of seeking a job where he would fail. 
Howard is likely to nod his head and leave, but he may be far from con- 
Vinced he should not be a foreman. He may quit and go to another company 
Where he'll “have a chance.” Howard may have failed to state or even to 
Tecognize that he is anxious to be a foreman because his brother-in-law is a 
foreman and he wishes equal status. Similar “irrelevant, nonobjective” fac- 
tors may lurk within the case of the student who studies inadequately, 
the airman who longs to be a pilot, the mother who overrates her child’s abil- 
ity, or the unpopular girl. The client seeking counseling phrases his problem 
to protect the tender spots of his ego The counselor who relieves a surface 
Problem may be helping the client to avoid facing his real conflicts. 

The nondirective methods suggested by Rogers help the client express his 
feelings. The counselor reflects the client’s feelings by rephrasing what the 
client has said: “You think you'd rather bea foreman Bus mäcliine oper- 
ator”; “Its discouraging when a man who came after you did is promoted 
Over you”; “You feel that the management doesn’t trust you.” Acknowledging 
the feelin gs, instead of trying to prove them false, promotes ultimate adjust- 
ment. The client, freed from need to justify or apologize for his attitudes, 


Bains insight into himself. 


The client is made responsible. He asks the questions, limits the area dis- 
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cussed, makes the judgments, and decides when to terminate the counseling. 
If the counselor proposes a test, or suggests that poor poc may be 3 
source of difficulty, or lays down alternative solutions, he is taking responsi- 
bility. He thereby risks pushing the client faster than he is ready to move. 

Tests designed to help the tester be wise become of secondary importance 


in client-centered counseling because they do not center on the feelings of 
the client. Says Rogers (1946): 


The counseling process is furthered if the counselor drops all effort to 
evaluate and diagnose and concentrates solely on creating the psycho- 
logical setting in which the client feels he is deeply understood and free 
to be himself. It is unimportant that the counselor know about F3 
client. It is highly important that the client be able to learn himself. 
(Not to learn about himself, but to learn and accept his own self.) In 
making use of these principles the counselor examines his own attitudes 
and techniques and endeavors to refine his procedures so as to elim- 
inate all which are not in accord with the 
tions are eliminated from the interview because they invariably direct 
the conversation, advice is eliminated because it assumes the counselor 
to be the responsible person, diagnosis and evaluation are put aside be- 
cause it has been learned that even when they are not voiced they tend 


to distort the counselor's responses in subtle ways and to break down 
his full acceptance of client attitudes, 


basic principles. Thus ques- 


counseling, tests enter only 
comes with the st 
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flict that set him to wondering about his intelligence. Perhaps he is worrying 
about changing his major; perhaps he is concerned because his grades in col- 
lege are lower than in high school. The problem may be as remote from that 
stated as a worry because his wife's family considers his pronounciation pe- 
culiar. The counselor who avoids bringing the conference to a head, giving 
an answer, and terminating the interview permits the client to dig into what 
really concerns him. 

Bordin and Bixler (1946) suggests that counselors place on the client the 
responsibility of choosing the tests to be taken. In contrast to establishing a 
Standard battery of tests to be taken, they invite questions about tests and 
discuss at length the sorts of tests available. They neither recommend a par- 
ticular test nor limit their description to the tests the client asks about. After 
hearing what tests can be had, the client takes the initiative in deciding 
among them. This is particularly helpful in erasing the idea, common among 
those who seek counseling, that one or two tests will give definite answers 


to every problem. 
Decisions made by the c 
clients than those they make themsel 
when he helps him to reason out his 
have made numerous suggestions to i 


interpretation and his self-examination. 
The counselor avoids giving opinions. The counselor is always tempted to 


comment on the goodness or badness of scores to build confidence or em- 
Phasize the seriousness of symptoms. Such evaluation comes between the 
client and the score and makes it harder for him to accept the score as a 
reality. Bixler suggests prediction in the form of an expectancy instead. 

Bixler’s second suggestion is that the counselor should be frank. Low scores 
must be faced honestly, if the client is to gain in self-knowledge. A test score 


inconsistent with the person’s previous impression of himself forces him to 
take a new look at his plans. Students characteristically overestimate their 


ability and interest in the vocational field they have chosen. Test results 
Which challenge these distortions can be beneficial, but they obvious "CE 
erate emotional conflict which the tester must turn counselor to dispel. 


What is less obvious is that favorable test results are equally likely to pose 
Problems for the subject. Bordin (1951) tells of the college student who 
titude" test because the test included 


arned a high score on a “scientific ap à pur 
Achievement items and he had taken considerable science in high school. Al- 


though the student “had made a definite choice of business administration, 
© was thrown into a state of indecision by this test result, partly because his 
father was a successful engineer. Later counseling proved that his original 


choice was well founded and that his indecision would have been short 


lived if the tests had been properly interpreted to him by someone who 


ounselor apparently have less effect on most 
ves, The counselor helps the client most 
own decision. Bixler and Bixler (1946) 
ncrease the client's involvement in test 
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i as 
could also have helped him to relate these results to his percept of himself 

i t from his father." T 
gode not argue against giving information to the oa ped iain 
opportunity for him to find out about himself, and it is Lng: oh a 
rect self-image than to leave him with false impressions. ES Pes ete 
must decide what information the person is able to assimila vip ci 
tage of achievement tests such as the SCAT in college "ra^ s — 
guished from tests which appear to measure intelligence, is = iri ien 
usually finds it easier to accept unfavorable evidence about his a 
than evidence of “low intelligence.” ha 
"s client must always feel free to reject any interpretation. - ie bn 
able to say that, though his score is low, he expects to ree - "d 
able to reject his own interest test score by insisting that he » io teas 
neering despite a low interest in computation. It is only n um viis 
that he need not argue with the counselor that he becomes re EE on 
himself nondefensively, The counselor should help the client ria vd 
emotional reactions to the test Scores. Emotional reactions block ra 


^ an un- 
thinking; the client can use the scores wisely only after he has come to 
derstanding of his emotions, d 
r " recor 
These points are illustrated in the following dialog from a case 
(Bixler and Bixler, 1946): T 
T ed 
Counselor. Sixty out of one hundred students with scores like yours I i. 
engineering. About eighty out of one hundred succeed in the social — ME 
The difference is due to the fact that study shows the college aptitude te 


: athematics. 
important in Social Sciences, along with high school work, instead of mathe 
Student. But I Want to go into 


n't 
engineering. I think I'd be happier there. Is 
that important too? 
C. You are disappointed wi 
liking engineering better isn't 
S. Yes, but the tests s 


if your 
th the Way the test came out, but you wonder if y 
Pretty important? 


ing like that. 
ay I would do better in sociology or something like th 
(Disgusted) . 

C. That disappoints You, because it's the Sort of thing you don't like. 

S. Yes. I too 


k an interest test, didn't I? What about it? 
C. You wonder if ; P 


shows that 
with the way you feel. The test shows 
most people with 


: joy socia 
your interests enjoy engineering and are not likely to enjoy 
Sciences— *t they? 
8. (Interrupts) But the chances are against me in engineering, aren’t i 
C. It 


all—Being Scared makes 


: ive d 
29. At what age is it appropriate for counselors or school psychologists to g 
child or adolescent information about his abilities? 
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30. Reread the counselor's remarks carefully. Did he at any time suggest a tie 
thought was right, or what he approved? Did he disapprove of any idea o 
the client? ` 

31. In ^» "v quoted, would it be helpful or harmful for the counselor to make 


these remarks? : 
a. It's probably better for you to work in an area you like than to follow these 


tests strictly. 
b. Misdoeuplo develop an interest in areas where they do well; you probably 


i i i if you tried it. 
would learn to like social science if y ] f 
C. If you stay in engineering, you should plan to take a course in remedial 


mathematics. . 
d. It seems to you that it's wisest to work in the field where your chances are 


best. " T 
32. Which is more likely to be threatening, a report on a general scholastic apti 


i ? 
tude test or a report on a battery like the DAT? 


Fact-Centered Counseling 


irecti nseling above, it 
Although emphasis has been placed on Baur sh aibsdlsus, "Tue es 
Should not be assumed that pee aes ty s prefer them Ad- 

i i Some counselor: à 
widely used under many circumstances. is à 
ce responsibility for 
ministrative requirements often force a counselor e ira i P ^ ay 7 

i a 
decisions as when a veterans’ counselor is required by le PI 
J 


i seling. 
i ; : Vhen a case is referred for coun g 
Vocational plans of certain trainees. When a 


» Vi y. i o client-cen- 
than coming i i e counselor cannot stick 
ir gin oluntaril 5 th t 


is i -di ion must 
tered methods. Cases in which the client is incapable of self-direction n 
also be prescribed for. 


ipti asize 
Those using tests prescrip tively oe trast to Rogers’ emphasis on the 
facts" as a basis for rational decision, in contra 


ipti selor tends to think of 
“motional meaning of the facts. The joe d mn gemein tests an espe- 
the client as leaning on someone for ionem, Ne a he problem 
cially sound basis n4 giving the direction em z Pei i ie ain wa 
of counseling is to convince the client that his plan: (Staff py R, sud 
tests are regarded as a forceful type of —— to biih his client to 
Guidance Service, 1946). The counselor who wis rh be rd a a 
face the facts takes a stand similar to John vu ; iod 
8 passage dealing with children, 1938, pp- i " 
The suggestion upon which clients act pu in any iiim "mi vo 
somewhere. It is impossible to understand bs n ar n be at least as 
Who has a larger experience and wider horizon s "e accidente] ums 
Valid as a suggestion arising from some mee od to force the activit i 
It is possible of course to abuse the office, an e ° Kn i 
of the young into channels which express te EA E ges counselor 
than that of the client. But the way to avoid this is not for 


the importance of "objective 
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to withdraw entirely. . . . The counselor's suggestion is not a mold for 
a cast-iron result but is a starting point to be developed into a plan 
through contributions from the experience of all engaged in the coun- 
seling process. 


Prescriptive counselors generally obtain a variety of information, make an 
interpretation, and bring the client to act on this information. While they re- 
spect the right of the client to choose between alternatives of merit and de 
not force even a wise course of action upon him, their emphasis is on keeping 


the client from making errors. Williamson (1939, pp. 134-138) puts the 
position this way: 


The effective counselor is one who induces the student to want to 
utilize his assets in ways which will yield success and satisfaction. + > * 
Ordinarily the counselor states his point of view with definiteness, at 
tempting through exposition to enlighten the student. . . . In respect 
to no student’s problem does the counselor appear indecisive to the ex- 
tent of permitting loss of confidence in the authority of his information: 
- . . If it is true that the counselor should not make the student's a 
sion, it is equally true that someone must render this very service unti 


some students are able, intellectually and emotionally, to think for them- 
selves. 


In helping the client make decisions, the counselor, whatever his ge 
nique, wishes the client to have a basis for optimism. The nondirective 
counselor would prefer that this come through insight, whereas the directive 
counselor tends to give direct encouragement. In either case, however; me 
client should leave the counseling with a positive plan for action, rather 
than merely with the knowledge that his former plan was inadequate. ae 
ilarly, he must have a feeling that he has some strong qualities, rather than ? 
total feeling of failure because tests have brought to light only weaknesses 
E every test performance, there are some praiseworthy aspects. The cou” 
Bm to give Support will call attention to such features yo v 
ER > i ity, or Persistence, in addition to giving the client facts a 

ore. Nearly all counselors working with normal late adolescents a” 


adults a in givi i 
c5 agree in giving the client the facts on which recommendations (i 
any) are based, The Couns 


form sets up a fear j i i " 
too poor. P à fear in the client that he was not told because his scores W 
The most helpful sin 
merely data on which to 
background facts, 


. 1 s 
gle principle in all testing is that test scores " 
base further study. They must be coórdinated ke h 
and they must be verified by constant comparison " 
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other available data. This is the reason that continued counseling by an ad- 
viser over a year is more effective than "one shot" counseling where an an- 
Swer is given to each new specific problem by a different adviser. The test 
Score helps the counselor by warning him to look in the record for further 
Symptoms of a particular problem. The score, and study of items within the 
tests, suggest topics to probe by interview methods. While sometimes it is 
necessary to act on a problem immediately, it is sound practice to defer a 
final decision as long as possible, meanwhile seeking confirmation of tenta- 
tive diagnoses. 
33. Discuss the advisability of delaying final decision in each of these situations. 
What supplementary information should be sought to confirm the tentative 


conclusions? 


a. A college student who is failing in engineering at midterm seeks a more 


suitable vocational goal. Aptitude and interest tests suggest journalism. 

b. An engaged couple, after a quarrel, seeks the help of a marital counselor. 
A personality test intended to predict marital adjustment (validity .50) 
shows that their score as a pair is low, in the range where there is an even 


chance of divorce. 

€. Students applying to enter a gr 
routinely. A girl shows severe neuro 
subtle, moderately dependable personality test. 


aduate school for social work are tested 
tic signs on both a questionnaire and a 


Suggested Readings 


Bennett, George K., & others. Counseling from profiles. New York: Psychological 


Corporation, 1951 
"This booklet presents a general discussion of the DAT and a philosophy of 
counseling, then discusses thirty cases showing a variety of realistic problems 
Where aptitude profiles are useful. 
Bordin, Edward S. ut selection and interpretation and Illustrations and prob- 
ems. Psychological counseling. New York: Appleton-Century-Crofts, 1955. 


: 262-331, ; " 
Bordin amplifies his view that tests imposed on the client without adequate 
Preparation may delay improvement, and shows by extracts from interviews 
how skilled counselors deal with such problems as neun ma poma 
to make decisions for him, and the client who has been forced into counseling. 
Lamke, Tom A., & Nelson, M. J. Single-score tests vs. factor-score tests. Examiners 
manual, the Hermer Nelson Tests of Mental Ability. Boston: Houghton Mifflin, 
1957. Pp. 19-99, 
The Henmon-Nelson test series yield 
When it was revised, this section was a EE $ 
authors had not shifted to the multiscore pattern. The authors view that dif- 
ferential testing has little or no advantage over single-score testing should be 


F below. 
compared to the views expressed in the Super reference, hale . 
Super, Donald B. (ed.). ihe use of multifactor tests in guidance. Washington: 


s a single measure of general ability. 
dded to the manual to explain why the 
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American Personnel and Guidance Association, 1958. (Also published in The 

Personnel and Guidance Journal, 1957.) 
In an unusual symposium each prominent differential battery is described by 
its authors. These articles combine factual information with a certain amount 
of "sales talk.” Following each presentation, Super gives a short but pointed 
critique of the test and the validation research on it. Super’s introductory 
paper (pp. 2-8) is a strong argument for differential testing and should be 
compared to the Lamke-Nelson reference above. 


H | 


Other Special Abilities 


ter are the ones most often used 


THE tests discussed in the preceding chap 
other tests of special abilities in- 


™ guidance, The present chapter describes 
cluding those for psychomotor and artistic aptitudes. 


PSYCHOMOTOR ABILITIES 


The only psychomotor performances considered to this point are the sim- 


ple Speed and dexterity measures of the GATB. Many tests using more elab- 
Orate apparatus and measuring more complex abilities have been tried, and 
many have shown predictive value. Since the tests are costly to construct, 
Maintain, and administer, their use is largely confined to industrial and mili- 
tary classification. 

The costliness of psychomotor testing, 


taini . 
5 ning adequate criteria of occupationa 
n motor abilities, Our knowledge rests almost entirely on a few research 


Programs, of which by far the most significant has been that of the Air Force, 
Which has large samples of men, excellent equipment and control of testing 
Conditions, and superior criterion data (Melton, 1947; Fleishman, 1956). 

All psychomotor tasks involve intellectual abilities such as are found in 
Pencil-paper tests. Many apparatus tests are correlated with factors P, S, and 
Mechanical Experience, as well as with strictly psychomotor factors. We 
Shall concentrate here ph the uniquely motor abilities and the tests which 
Measure them, We shall describe a number of illustrative tests before turning 


to cm 
a factor-analytic classification of motor abilities. 


combined with the difficulties of ob- 
] success, has discouraged research 


Si 
'mple Performance Measures 


" Reaction Time. Measurement of reaction time goes back to the earliest days 

experimenta] psychology. The techniques used today differ only in ele- 

Bance of instrumentation from some of the procedures Wundt and Cattell 
301 
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introduced in the first psychological laboratory at Leipzig. The subject is 
told to react to a light or other signal as quickly as he can. When he presses 
the response button, an electrical timer records the interval that elapsed be- 
tween signal and response. 

Modern apparatus can present a whole series of stimuli, record times, and 
cumulate the score—all automatically. The signal apparatus is *programmed" 
by a tape or a cam so as to present signals at irregular intervals. Such auto- 
mation is important for tests involving complicated stimulus patterns, as m 
measures of discriminative reaction time, because it speeds up testing and 
reduces the variation in testing procedure. 

Although it has often been thought that simple reaction time is relevant to 
automobile driving and to many jobs, consistent evidence to support this 
view is lacking. Simple reaction is a different matter entirely from reaction 
with judgment. A test of discriminative reaction time, where a different but- 
ton must be pushed for each pattern of light signals, correlates only about 
‘30 with simple reaction time (Melton, 1947, p. 102). Most practical pe’ 
formances probably depend more on choice reaction than on simple reac- 
tion. 

Steadiness and Simple Controlled Movement. Steadiness is required where 
one must maintain a fixed posture or must trace a pattern accurately. Pos- 


In 
stylus and base plate are a small aperture without touching its sides- 
tered on a counter, mnected electrically, and each contact is 7°" 
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So-called “aiming” tests involve quick, precise eye-hand coórdinations. 
Aiming may also be measured by a stylus-and-hole apparatus. The subject 
is required to thrust the stylus into successively smaller holes without touch- 
ing the sides, or into holes momentarily uncovered by a rotating shutter. A 
pencil-paper version of this test requires the subject to place dots in small 
circles as fast as he can; this test involves motor speed as well as precision 
of movement. 

Tests of aiming and steadiness had negligible validity for selection of pilots 
and bombardiers, Arm-hand steadiness is related to success of aircraft elec- 
tricians, according to one study. Several studies have found very high correla- 
tions between aiming or steadiness tests and rifle marksmanship (Hum- 


phreys et al., 1986). 
l. Decide which type of steadiness test would be most promising for selecting per- 
sons for each of the tasks listed below. If none of the tests mentioned above 
seems fully suitable, attempt to describe one more comparable to the job. 
9. A jigsaw operator is to move a board, about eight inches square, so that a 
Curved pattern is cut out. 
b. A rifleman must hold his sights steadily on a t 


in a prone position. , ; ; 
A pistol marksman must hold his sights steadily on a target while standing. 


d. An engraver must follow a pattern with great precision, using a small power 


tool. 

Speed and Dexterity. We have already encoun 
the USES tests, where it enters into scores K, F, and M. The nearest to a pure 
Measure of movement is the Mark Making test (p. 274). In Table 39 we 
Poted that the manual factor M, involving speed and dexterity, correlated 
80-55 with success in many jobs, having a notably high correlation with suc- 


cess of persons mounting wires in radio tubes. Factor K has equally large 


Correlations with such jobs as typing, telephone operating, packing, and out- 


r : r d is important i r- 
oard motor assembling. In general, motor spee p n over 


earned routine tasks. 

" he manual and finger dexterity test 
uus MN, Some other tests require 
Xample, inserting pins into narrow holes 
Po bolts. Low-to-moderate positive correl 
Ssts as predictors of office and factory jobs. 


arget while resting on an elbow 


p 


tered speed of movement in 


s of the USES require simple rapid 
more complicated movements—for 
with tweezers, or threading nuts 
ations are reported for dexterity 


c 3s 

9mplex Codrdinations 
described above, one can ask the sub- 
ere is little rationale to guide in the de- 
deal of apparatus testing has been 


ours of the fairly simple tasks 

» to do quite complicated acts. Th 
780 of these complex tasks, and a good 
ased merely on hunches. 


CT 
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One principle that has often worked well is that of the job replica. neu 
are selecting workers to perform a particular job, we might ey siet 
sample, i.e., we might observe them briefly on the job itself anc - n 
their output. If the job requires training after selection on uses expensiv T 
paratus, however, the true worksample may be impractical. In such a t 
the tester tries to design an apparatus which reproduces much of the origin’ 
task, without requiring skills that have to be developed during job E 

An excellent example of the job replica is the Complex Coórdina ies 
test of the Air Force (Figure 2). One cannot observe a would-be pilot "its 
airplane, but the Complex Coórdination test gives him a stick and ru y 
bar which he is to move much as the pilot does. Movements are dictated H 
signal lights. When a light appears at the top of the left center column: 
man pulls the stick so that the right center light will move upward to m 


EE and 
it. A sideways movement of the stick controls the light in the top row, @ 


the rudder controls the light running across the bottom row. 


This test had a validity of about .40 for predicting pilot success and ga 
given the highest weight among all tests used in the selection battery. A € 
tor analysis demonstrated the reason for this high validity: the ee 
Coórdination test duplicates better than any other test the common-fac m 
composition of the student pilot's task, Table 41 (cf. Figure 44), gives 


TABLE 41. Factor 


loadings of the Complex Coórdination Test and the 
Pilot Success Criterion 
F Complex Co- Graduation- Product o 
‘actor ordination Elimination Loadings 
Test Criterion 
Spatial 
167 
Psychomotor CoSrdination P4 34 075 
echanical experience 20 22 052 
Interest in piloting : 26 ‘048 
isualization 17 28 “042 
erceptual speed 17 25 026 
'umerica| 17 5 SOT 
Verbal .09 ‘Ol .00 
Reasoning —01l —.02 ee 
"interpreted factor .02 —.02 —.0! 


Melton, 1947, p, 995 


; ; 
loadings of the test and of the crit 


; init y 
(graduation from pilot t! ho“ 
agrees almost ey 
ity is a 
F 5 accounted y by th ans tH 40 
Pecific Content of t y the common factors, which mean? ig 


O; n sia : 1 
mplex Codrdination test does not cont 
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prediction of pilot success. Note that the spatial aspect of the test accounts 
for more of its predictive value than does the coórdination factor. 
Another type of complex task calls for “pursuit” or “tracking,” i.e., follow- 
ing an irregular course or a moving target as in gunnery, radar operation, 
and high-speed maneuvering. The essence of a pursuit test is a moving target 
which must be followed with a pointer of some kind. Four pursuit devices 
are shown in Figure 55. The Rotary Pursuit Test is the simplest and the 


ROTARY PURSUIT 


oe, RUDDER CONTROL 
TWO HANO COORDINATION 


1956). Except for Rudder Control, 


asks (Fleishman, 
f desk-top size. 


FI nef 
G. 55. Four pursuit or coördination t 
the tests are O 


Which is large enough for the subject to sit in, 
oldest, Tt has been used as a predictive device and as a laboratory instru- 
ment for the stu dy of skill learning. À 3-inch brass disk is set in a bakelite 
turntable, The subject uses a stylus with a hinged handle to follow the disk, 

I$ total contact time being recorded electrically. Many variations are pos- 
Sible. In the Pursuit Confusion Test, the speed of the target changes, and 
He subject has to guide his tracking by W^ n a mi 
viewing the target directly. The Two-Hand Coórdination Test involves 
slower but more complex movement. One handle controls left-right motion 
Of the follower arm tla the other controls front-to-back motion. Both must 

* moved at the SEND time, at different speeds, to stay on the target. The 
Rudder Control Test is another job replica, which has the honor of being the 
only Psychomotor test invented during World War II which proved valua- 


tching in a mirror rather than by 
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Manual dexterity. Skillful, controlled movements in manipulating R 
objects with whole hand. The Turn test is one of the better measur 
is factor. 2: 
i discrimination. Making precise bodily adjustments on the —- 
of postural cues. Walking a rail blindfolded would probably be a jm 
test. The experimental measure of this factor is a test where ed san 
is seated blindfold in a tilted chair and must push buttons to bring 
chair upright. -— 
Fine — AE coórdination. Also called “fine control piros, 
Delicate, highly controlled adjustments involving large-muscle groups, 
in Rotary Pursuit and Pursuit Confusion. - 
Multiple limb coórdination. Using two arms, arms and legs. bond E 
simultaneous control movement such as clutching-and-shifting an ut 
mobile transmission. This is measured in Complex Coördination and Rud- 
der Control. " 
Rate control. Involves continuous anticipations and adjustments of tim 
ing in tracking a target with variable speed and path. [al 
Response orientation. Choosing the proper response among € ep 
ternatives. This has been tested by complex discrimination tasks w. s 
each signal pattern calls for movement in a different direction. Can 
measured by pencil-paper tests as well 
Response integration. Combination of 
grated motor response. Two-h 
tion involve this ability. 


as by apparatus. x: 
JM Foto: istü- 
information into a single i a 
ö : : " ordina- 

and Coórdination and Complex Coór 


In addition to this main lis 
“strength” are found, and 
not yet well established, 


w 
Psychomotor tests which involve different factors often have very 7 
correlations, (For example, Rate of Manipulation—Placing correlates a 
02 with Rudder Control.) This definitely rules out the idea of a genet 


ath- 
psychomotor ability which makes some people good at any manual or ath 
letic task, 


" T h as 
t, a number of sheer physical factors "e es 
: ich 
a number of lesser psychomotor factors which 2 


3. Which factor involves com 
emphasis on speed? 
4. Which factors do you think are involved in each of these tasks? 
a. Riding a bicycle. 
b. Typewriting. 
C. Cutting dress materials, f i 
; foll tt 
5. These item i ree ee 
it 


ith little 
plex movement of small-muscle groups, wit 


HH e 
i 5 are included in the MacQuarrie Test of Mechanical Ability. All ar 
given with short time limits. Which fi 

a. Dotting; 94 g-inch circles, irre 


b. Tracing; a series of 1-inch ve 
where along its length. Trac 


actors does each seem to measure? ird 
gularly spaced. Place one dot in each circ E 
rtical lines, each with a JA g-inch opening so 

© a path through the openings. 
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c. Tapping; 34-inch circles regularly spaced. Put three dots in each circle. 

6. The Purdue Pegboard requires the subject to place small pegs into holes, first 
with his right hand, then with his left hand. Loadings for the right-hand and left- 
hand scores, respectively, on various factors were as follows: reaction time, .25, 
.02; arm-hand steadiness, .14, .06; rate of arm movement, .22, .13; finger dex- 
terity, .46, .58. What explanation can you give for the differences observed be- 
tween right- and left-hand scores? 

7. The correlation of visual and auditory reaction 
gap between this value and the reliability of .85 best be explained? 


time is only .56. How can the 


GENERAL PROBLEMS OF PSYCHOMOTOR TESTING 


Apparatus Differences 


Test apparatus is supposed to be standardized, especially when results at 
One time and place set standards to be used in future selection or guidance. 
The Air Force found that even when several pieces of apparatus were made 
in the same shop from the same blueprints they were rarely equivalent. 
Moreover, each apparatus changed over time as electrical contacts became 
dirty, rubber parts became less elastic, and so on. For example, in the rela- 
tively simple Arm-Hand Steadiness Test the mean scores earned on four 
different pieces of apparatus were 227, 230, 260, and 291 (Melton, 1947), 
These differences are of practical importance, since the standard deviation 
Of scores is about 120 points. A score which was average on one machine 


would be near the 30th percentile on another. 


Pencil-Paper Measures of Motor Performance 


. Apparatus tests are virtually out of the question in guldanes testing, and 
it has rarely been practical to use them in industrial and military selection. 
Initia] cost is not the critical factor; Melton estimates that TESI covered 
the total cost of apparatus for processing tens of thousands of Air Force men. 
The big cost is in time of the persons who must give the tests, P P with 
highly efficient arrangements a tester can handle only four to six subjects at 
Once, So long as apparatus tests predict validly, they can save enough to 
More than repay their costs. In the Air Force, every man who failed pilot 
training represented a waste of $25,000. quce agen. tenis 
tests to detect failure in adv heless obvious that testers 


ance. It is nevert 
Would gladly substitute pencil-paper tests if these would measure the same 
aptitudes, 


We have seen in the GATB Mark M i ; 
Paper *psychomotor test. Some psychologists are convinced that with suffi- 


Cient ingenuity other important motor abilities can be reduced to group 
Pencil-paper tests, The evidence on the question is extremely fragmentary, 


aking test an example of a pencil- 
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One Air Force study (Melton, 1947, pp. 1033 ff.) found that when appara- 
tus tests and pencil-paper tests of motor speed were put in the same battery, 
the two groups defined quite separate abilities. Such small efforts as were 
made during the war to obtain validities for pencil-paper psychomotor tests 
were discouraging. More recently, Fleishman (1954) introduced several 
printed psychomotor tests into a battery and found that they were quite suc- 
cessful in measuring wrist-finger speed and fairly successful in measuring 
aiming and steadiness. The pencil-paper tests had little in common with the 
more complex coórdination and dexterity tests. 


Reliability 


As in the case of performance tests of general ability, reliability has been 
a source of difficulty in psychomotor tests, It will be recalled that in the 
GATB, F and M are the least reliable scores. Reliabilities for apparatus tests 
as usually given are in the neighborhood of .70. This level may be satisfactory 
when the test is to be combined with several others in an overall prediction, 
but it makes the test untrustworthy by itself. ; 

One might reasonably suppose that extending the test period would raise 
reliability, and if so, only cost of testing prevents us from boosting reliability 
just as we would for a pencil-paper test. The reliability of apparatus tests 
does not increase with length in the normal manner, however, because two 
successive sections are not "equivalent." This is best shown with the Rotary 
Pursuit Test, where the interna] consistency of a ten-trial score is .97, but the 
correlation of the first ten with the second ten trials is only .84. For all the 
Air Force tests the same thing was found: reliability increases with length, 
but more slowly than the Spearman-Brown formula predicts. 

The reason is that the test measures different things at different stages of 
practice. The first and second ten trials are outwardly similar, but psy- 
Spiels they pose different tasks for the subject. To that topic we now 

rn. 


Changes in Meaning with Stage of Practice 

In an aptitude test itis im 
of the person Over a 
chomotor test witho 
liminary trials, Th 
emerging, 

On com 


portant to obtain a stable measure, characteristic 
period of time. Scores are unstable if we apply 2 P dl 
out giving the subject a chance to learn the task in em 
is is again an example of the principle that abilities, whi 
cannot be accurately measured. 

plicated testing devices, a subject cannot show his full ability U?" 
til he has become familiar with the reaction required. Fleishman a” 
Hempel (1954a; Fleishman, 1957) gave 64 two-minute trials on the Com- 
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plex Coórdination Test. (Eight minutes is the usual testing time.) The scores, 
together with reference tests, were factored. Figure 56 shows that the factor 
content of the test depends on the amount of practice. That is, different men 
Score high at different stages of practice. In the early stages, cognitive 
factors such as S and Vz are most important, along with Psychomotor Coór- 
dination. These interpretative factors account for little variance after sub- 
jects become familiar with the task. Psychomotor Coórdination, a factor 
common to this test and to other motor tests administered, increases in im- 
portance during the first 40 minutes of practice but then drops back. Two 


100 


Coérdination, Specific Factor 


co 
o 


a 
o 


Coérdination, Group Factor 


A 
[] 
T 


Motor Speed 


Percentage of Identified Variance 


RI 
o 
T 


Cognitive 


Perceptual 


9 10 20 30 40 50 60 70 80 90 100 
Practice Time (minutes) 

Gi Be iti «dination Test as a function of practice. (Data from Fleish- 

Composition of Complex Coórdination (ox ee Ree ate 


man and H tion of variance accounte 
em show proportion i i 
pel 10545] Curves ue factors from consideration. Curves have been 


removing ok ke i 
err i dentified minor " à 4 e 
smoothed, nd. parence a dem combined as follows: Cognitive includes Spatial, Visualization, 
and Mechanical Experience; Motor Speed includes Rate of Movement and Psychomotor Speed. 


factors grow steadily in prominence: rate of movement and a specific factor 
Which we shall discuss further in a moment. Evidently the early trials meas- 
ure the person’s ada ptation to a new task, and intellectual factors play a large 
Part in the variance. At the end, sheer speed has become one of the leading 
Sources of individual differences. 7 

d in the Complex Coórdination scores 
Rotary Pursuit, becomes the largest 


Presumabl icular coórdinations or work methods which 
: y corresponds to particula 

the Individual aes as he practices (Stevens, 1951, pp. 1841-1362). If a 
Man with good general aptitude happens to fall into a bad habit, his final 
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score may be far below his potential. Specific bad habits are €—€—— 
from trial to trial, as any athlete knows. They are not just a matter E ine d 
tude; the professional coach makes a career of recognizing and e = : i 
such faults from the performance of talented athletes. Not all pom = E 
sponses are harmful; sometimes one stumbles into a fortunate [ean "d 
rhythm which gives him a higher score than his aptitude would ups : 
dicted. The specific factor built up only in the Complex isa gn ia 
unlikely to be of much predictive value. There is no reason to we B : 
man who acquires a certain bad sequence of actions on this test wi 


. 2 v 
similar habit when learning to fly a plane. In learning that task, other specific 
habits will develop. 


Stability 


When enough research has been done to provide a basis for practical jaa 
pretations of the factors, measures of these factors may be able to play i. A 
jor role in vocational guidance. Before tests can be used to make long-rang 
predictions, however, research on the stability 
be necessary. There is considerable evide 
ods of a few months, but little is known 
guidance of adolescents, it is necess 
abilities begin to stabilize, Th 
able come from a study mad. 
pupils were tested with the G 
limited, suggest strongly th 
the same age as intellectu 


of psychomotor aptitudes = 
nce of stability of scores over peri- 
about stability over several years. For 
ary to know at what age pppoe 
only substantial follow-up data so far er 
€ in a Texas high school, where the same 3 
ATB each year ( Table 36). These data, eme 
at simple psychomotor abilities stabilize at abo 
al abilities. 


Validity 


Any verbal or intellectual tes 
numerical reasonin 
shop grades, all e 
general ability, Desp 
hands,” there is no s 


peri- 
hope of finding a good predictor, past ee 
job replica as the best bet. The common-sense rule tha d off 
les the job ought to predict the job has generally pat 
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with encouraging validity coefficients. If intelligently designed, the job rep- 
lica can demand the same coórdinations, speed, and precision as are called 
for in the criterion task. Furthermore, it looks like a reasonable test and 
therefore appeals to the subject and to the employer in whose behalf the tests 
are used. There is little doubt that job replicas will continue to be used in in- 
dustrial and military selection for some time. 

Some of the results for the Complex Coórdination Test raise a serious ques- 
tion as to whether the job replica deserves its good reputation. Much of the 
validity of that test in pilot selection was due to intellectual abilities which 
could be measured in pencil-paper tests. The psychomotor aspect of the test 
which predicted the criterion was the coórdination factor common to several 
apparatus tests which were nof replicas of the pilots job. Rotary Pursuit, 
which does not “look like" anything the pilot does, was a better measure of 
the coórdination factor than Complex Coürdination. If there is anything spe- 
cial about the Complex Coürdination Test as a replica of the pilot's job, it 
must be in the specific factor, and that contributed nothing to validity. So the 
magic of the Complex Coórdination Test appears to be that it involves about 
the right mixture of common factors. If so, such a mixture could be put to- 
gether by adding scores from other tests measuring the same factors. 

One of the reasons for wishing to avoid the job-replica principle in test de- 
sign is that it leads to an endless process of inventing and revising tests to 
cover additional jobs or to take changes in job requirements into account. 
Vocational guidance could not possibly be based on such tests, since hun- 
dreds of tests would have to be given to cover the occupational spectrum. 

Fleishman believes that it will be possible to prepare a short battery to 
measure the chief psychomotor factors, and to combine scores from this bat- 


tery so as to predict the psychomotor component of any job. At present, no 


one can say how well the factor scores will predict jobs, and one cannot guar- 


antee that the list of psychomotor abilities will remain short. The factors do 
account for about half of the variance in current tests; therefore combina- 
tions of the factor scores may be able to do much of what the original tests can 
do. While psychomotor abilities are often of value in predicting occupational 


Performance, they are generally of less importance than intellectual abilities. 
In the USES studies, there were relatively few occupations where motor fac- 
tors were substantially better predictors than the nonmotor factors. These ex- 
Ceptional occupations fall into two broad categories: bench work or assem- 

ling (cheese wrapper, telephone diaphragm assembler, paper-pattern 


older) and manipulative machine © ng (machine clothes presser, bag 


Sealer). Motor tests have excellent validity for these routine jobs, but as soon 


as a job becomes less repetitive, perceptual and intellectual factors make 
large contributions. Chiselli’s study (1955) of other published coefficients 


Supports this conclusion, but adds that motor tests make a substantial con- 
> 


perati 
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tribution in predicting success of structural workers. Tiffin (1952, p. 126) 


comments as follows on the failure of motor tests to predict skilled work (see 
also Patterson, 1956): 


A consideration of the skills demanded of the industrial tradesman or 
skilled machine operator indicates that this employee usually succeeds 
or fails in proportion to his training and general mechanical compre- 
hension, not in Proportion to his basic dexterity. This fact does not mean 
that successful tradesmen do not need skilled movements, but it does 
mean that such muscular coordination as may be needed can be devel- 
oped by the majority of tradesmen in training and that it is lack of me- 
chanical comprehension rather than inability to develop the muscular 
aspects of the job, that may prevent them from becoming really profi- 
cient in this line of work. This implies that only in the most repetitive 
performance can psychomotor tests alone provide an adequate basis for 
where psychomotor tests do not predict ulti- 
mate proficiency, they nonetheless may make a valuable contribution by 
identifying which persons can master the motor components most rap- 


8. What function might psychomotor tests 
tion programs more profitable? 

9. Experimenters wish to study the effect of Vitamin lack on motor performance. 
They plan to test a group, then alter the diet, and test again after some time. 
Would it be desirable to offer training on the tests before the first measure- 
ment? 


10. Under what circumsta 


play in making school physical educa- 


11. Bennett and Cruikshank (1942 


ARTISTIC ABILITIES 


General menta] tests measure abili 
does not 


; igh score 
ty to succeed in courses, but a high sc 
guarantee creativeness e 


ae 
ven in strictly intellectual work. In pain 
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much as talent, but it is a fair basis for comparing persons with similar train- 
ing. Merely asking the subject to draw or paint a picture, or to submit a piece 
of completed work, however, does not make for a standardized comparison. 
To standardize the task by requiring everyone to draw from the same model, 
on the other hand, leaves no room for creativeness. 


FIG. 57. showing stimulus lines and two 


Art Aptitude Inventory, 
drawings b 


Specimen item from Horn d L. F. Smith, 1945). 


ased on them (C. A. Horn an 

The Horn Art Aptitude test attempts to solve this problem by a job replica 
Calling for hi gh-level creativeness under very slight constraints. In the “im- 
gery” section of the test the subject is given several cards, each bearing a 
Pattern of lines, Around these lines he is to $ à : 
judged by art instructors as to imagination and technical drawing quality. 

Sing careful scoring directions, competent judges can attain a correlation of 
86 between inde pendent scorings. The other chief section of the test calls for 
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i i sed composi- 
arrangement of rectangles and other simple figures into balanced P 
tions. 


m 
The test is intended for use with applicants to art school, most of who 
have had previous training. The sc 
lated .66 with grades in a speci 
a . F. Smith, 1945), l -— 
E C Ewing also favor the job-replica principle. Their — 
Illinois Art Ability Test (based on one type of item in an earlier test xime 
ber) asks the subject to draw certain objects (e.g. a table) in i atis 
The drawing is scored not only for the technical quality of the per ae di 
but for the extent to which the subject has elaborated or veda nal 
ject. A table which shows attractive lines and proportions Nei vus a im 
score than a graceless one. The test requires artistic skill, but it s is da 
creative effort. Scoring rules have been developed which permit rit 94. 
Score the papers objectively; two Scorings of the same papes corre ia sting 
A validation study on students of architecture shows several - vell 
facts. In Table 43 we see that the test predicts art courses moderately 


i ar corre- 
ores at the beginning of the yenr a r 
al art course for high-school seniors (C. 


TABLE 43. Prediction of Succes 


5 in Freshman Architecture Courses 


e, 
General Grade ae 
Engineering Freehand Other bined 

Test Drawing Drawing coman a: 
Ilinois Art Ability Test 26 42 r^ 
Object Aperture Test (spatial) Rd 30 E 
Coóperative Mathematics Test 40 27 ^45 
ACE (general ability) 40 25 ‘09 
Bennett TMC -60 10 “49 

Rank in high-school class ` 


Note: The 
he number of 


Sounce: These are unpubl 
and L, J. Cronbach ^ "Pu 


but gives a poor predi 
very well by the TM 
cluding English and 
ACE tests, and a m 
importance in M 
Special aptitudes 


ction in en 


average in other em ha 
ted by high-school marks, pa 
- This points to a fact of d 
hough a student Lent 
annot use them in a profession «iege 
© carry him through general ei ate 
act, play little part in determining eh 
Teshman grades, where the drawing courses are outwe , 
courses. The TMC and the Art Ability Test correlate 
-12 with the over 


e 
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Analytic Tests 


The job replica gives valuable information about students.who have had 


some art training, but it neither clarifies the nature of artistic talent nor gives 
a basis for comparing untrained persons. For these purposes, it is necessary 
ability prior to training. These components 


and the tests now available are based 
only on some investigator's hunch as to what makes an artist. One such test 
is the Lewerenz Test of Fundamental Abilities in Visual Art (California 
Test Bureau). Among the aspects of artistic ability measured are preference 
for designs, drawing a sketch to fit a pattern, locating proper positions of 


to test components of artistic 
have not been adequately identified, 


ent. The arrangement of the woman's burden 


FIG. 5g " d 
- Item from the Meier Test of Art Judgm Copyright 1940, Bureau of Educational 


Hie been changed. Which arrangement is better? (Ite! E 
esearch, State University of lowa. Reproduced by permission.) 


a form (vase) from memory, correc- 


sh. : 
adows, art vocabulary, reproducing 4 
actically no information on the 


me perspective, and color matching. Pr 
a idity of the te blished. 
y of the test has been pu oe 


The most adequate analysis of art ability to d . 
Who took into account biographies of artists and experimental test results. He 


Concluded that six traits distinguish the artist: fine eye and hand coördina- 
tion, energy and concentration, intelligence, keenness of observation, crea- 
tive imagination, and aesthetic judgment. While Meier planned vests for 
creative imagination” and “aesthetic perception, only a test of artistic judg- 


me 
ent was actually completed. 


5 The Meier Art Judgment Test (and 
ee has been more widely used th 
Tes taste rather than ability to use art me 


an earlier form by Meier and C. E. 
an any other art test. The test meas- 
dia. A "good" work of art is altered 
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in composition, shading, or some other quality so as to damage its en 
appeal. The original and altered pictures are presented bendi aoe nm 
selects the more pleasing one (Figure 58). If he agrees with "iw ts v 
have taken the test, he gets a high score. The needed validity studies + i 
test have not been performed; a few studies give favorable but very frag 
mentary evidence. : . ot his 

One difficulty with tests of judgment is that the subject may give, ne "s 
own opinion, but his guess as to what experts will accept as best. An exp * 
ment on the Graves Test of Design Judgment showed that by giving s E 
guesses the subject could gain several points over the score his own prete 
ences would earn ( Buros, 1953, p. 336). 

Artistic judgment is distinct fro 
derson (1951) warns against reli 
in the fine arts. Persons with poor 
ising by art instructors. Her co 


m ability to perform artistically. Rose An- 
ance on the test of judgment as a predictor 
Meier scores often are judged highly prom- 
unseling experience, she says, 


has led to considerable caution in encouraging clients toward mu 
Specialization. On the other hand, the combined results of several es 
provide a more adequate basis for appraising potentialities for such ra 
plied fields as advertising art, format, interior decoration, COSEBIUS des 
sign, and crafts. The appropriate combination of supporting aptitu à 
i rior artistic judgment reflected in a high McAdory i 
Superior facility for spatial visualization and fine eye-hand coordinatio z 
manual dexterity, evidence of drawing ability reflected in the Leweren 


ER n i A 
Originality of Line Drawing Test [subtest], in the Horn Art Ap titud 
Test, or in work samples, 


The McAdory test is a test of taste in f 
well as in fine art ( 

Research on artisti 
research has been q 
the tests haye been 


urniture, clothing, and automobiles 25 
Rose Anderson, 1948). tic 
€ abilities is still in a most primitive stage. No systema a 
Tn tests and adequate criteria. ear 
€ when first designed as much e i ti- 
“Up research or revision, The nature of artistic ap 


solved—and neglected —problem 
12. What criterion would the Meier test be expected to predict better than t 
dm ice and vice versa? f the tests 
dcn ids of art aptitude do not appear to be measured by any o 
Y pde traits listed by Meier most likely to reflect talent, training, or €" 
15. 
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Object Aperture Test but at the 30th percentile in mathematics and general 
ability? 


SOCIAL INTELLIGENCE 


A recurrent interest of aptitude testers has been to identify qualities which 
help one to get along with other people. It was once suggested that there 
might be three categories of intelligence: abstract, practical, and social. We 
have found some evidence permitting us to distinguish the first two, though 
both of them must be further subdivided. After fifty years of intermittent in- 
Vestigation, however, social intelligence remains undefined and unmeasured. 

There are wide individual differences in performance of such roles as 
salesman, Army officer, teacher, and psychotherapist. General intelligence 
has something to do with success in most of these assignments, but it surely 


is not the whole story. Perhaps the difference between the successful and the 
hiefly on personality and interests. Many 


Unsuccessful performer depends c ini 
testers have tried to identify an intellectual component in ability to respond 
s merit brief attention. 


successfully to others, and these test: 
In 1926, a Test of Social Intelligence was developed by F. A. Moss and 
University). There are four sub- 


others (published by George Washington 
tests in the revised (1944) version: judgment in common-sense social prob- 


lems (e.g., “What is the best way to ask a favor of someone you know only 
slightly?” y. matching statements with the emotions expressed, everyday psy- 
chological generalizations in true-false form (“In social relations, demands 
are usually more effective than requests"); and completion of a joke (multi- 
Ple-choice), This test measures general or verbal ability to some degree, but 
there is no evidence that it measures any distinct ability which has practical 
Predictive value, Enough attempts were made to establish the validity of the 
test for selection of salesmen, etc., to indicate that this line of approach is 
fruitless (R. L. Thorndike and Stein, 1987). rere 1 
» The last few years have seen a large number of tests xt social sensitivity," 

insight into others,” or “empathy.” Personality theorists have argued that 
good personal relations depend upon good communication, and that a good 
leader or therapist is he who is sensitive to the ideas and feelings of others, 
even the unvoiced ones. We measure A's understanding of B by asking A to 
describe some aspect of B and comparing this judgment with an independ- 
ent criterion, The most common method, because it is the simplest to apply, 
is to have B fill out a personality questionnaire describing himself, and to 

ave A fill out the questionnaire as he thinks B would. The method is capa- 

le of endless variations, depending on whether A is asked to judge friends, 
Work associates, or strangers, on what opportunities he is given to observe the 


Str. 
angers, and on what questions he has to answer. 
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This is a job replica which encounters all the pitfalls of an approach rest- 
ing on surface similarity rather than psychological understanding of oa 
able measured. It is a reasonable assumption that a teacher or therapist or 
leader ought to understand what those he works with are thinking. The en 
design seems like a simple translation into scorable form of an act which he 
performs every day. Careful analysis of these tests, however, shows that a 
responses do not depend on insight into the individual (Cronbach, 
Gage and Cronbach, 1955). No evidence of validity is yet aan 3 
warrants confidence in any present technique for measuring a person’s abil- 
ity to judge others as individuals. 


SPECIAL APTITUDE TESTS FOR COURSES AND PROFESSIONS 


About thirty years ago, numerous Attempts were made to develop special- 
ized aptitude tests for particular school subjects or curricula such as a 
foreign language, engineering, or law. The test was usually prepared on ies 
basis of a superficial analysis of the course of study. Test problems s 
based on the type of content to be encountered in the course (eg. a Tap 
language test might involve substituting nonsense symbols for words in ? 


+a] prob- 
Sentence; a legal aptitude test would ordinarily present hypothetical pro 
lems in legal reasoning). 


The tests of this first 


. " » ea- 
period have virtually disappeared. The primary r 
Son is that the introdu 


ction of content specially relevant to the course " 
study did not raise validity appreciably above that which could be obtainé v 
with a good measure of general ability. When group tests began to pu F 
separate scores for verbal, quantitative, and later spatial and mechanic? 
comprehension, these broader-purpose tests appeared to offer all the ave 
tages of a special test for particular subjects. Prediction ordinarily can rest i: 


H E 
either a general menta] test, a verbal test, or a general proficiency rx 
though there may occasionally be an advantage in considering special 25 
ties also, 


Psychological study is made of a type of training bor 
iL 
in genera] ability tests. The best example is the MGE 
Carroll and Sapon ( Psychological i gel 
ed to select overseas employees to take pe 
» Showed validities in the range 60-.75. ] to 
ents, predictive validities vary from pl 
course. For four languages in one high m 
© .60; these validities are about .20 higher 1 
for general menta] ability. The test egiela a 
implying that not all very bright pupils are super. 


OTHER SPECIAL ABILITIES 321 


language learning. The test is administered by means of a tape recording so 
às to test the pupil's ear for strange sounds, his ability to learn new material, 
and his sensitivity to grammatical forms (Carroll, 1959; Harding, 1958). 
Many aptitude tests have been developed in recent years for use in gradu- 
ate and professional schools. The tests are for the most part measures of gen- 


eral ability or academic achievement, such as might be used for general col- 
fficulty to the level of students tested, 


lege counseling. They are adjusted in di 
s of obvious interest to the profession 


and place extra emphasis on the abilitie 
ìn question. This special emphasis may make the test a slightly better predic- 


tor than a general-purpose test. 


TABLE 44. Validation Research on the University of California Engineering 
Xamination 


ICSMBSNORO O O O O OO O OO O lM 


e Corre- 
lation 
with 
Fresh- 
man 
Grade Decision 
Time Relia- Aver- Regarding 
! General 1 Word meaning 15 93 .11 Study further 
scholastic 2 Verbal fluency 10 .63 — Drop 
ability 3 Figure classification 30 73 .12 Revise 
3 bi 
" 4 me vocabu zd 95 a the 
Mathematical 5 Quantitative in- 
reasoning * ference 70 91 39 Use 
T e x» 20 — 89 A Drop 
Scientific 7 Understanding 
" o: relation- $0 93 da. die 
Spatial visuali á 10 .92 .06 — Study further 
— Fe l0 $0 — 06 Study further 
10 Length of time m ri p" ead 
2 ey am 15 95 .09 Study further 
T 13 Matching parts 10 79 .03 Drop 
igh-school grade 39 


Sounce: M, H. Jones and H. W. Case, 1955. 

The development of a professional aptitude test is illustrated by a report 
rom the University of California regarding a test for the selection and guid- 
ance of applicants to the Engineering School (M. H. Jones and H. W. Case, 
1955), Four sections were developed: a general ability test, a mathematics 
test, a test requiring interpretation of scientific data, and a spatial test. Each 
Section had several parts, adjusted in length according to their presumed im- 
Portance, Table 44 shows reliability estimates on limited samples, and valid- 
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ities based on 583 engineering freshmen. Test 2 was dropped without com: 
lete trial, because of low reliability, b 
d The previous ( high-school ) grade average is the best predictor; gs um. 
common finding in academic prediction. Such a predictor is to some Me e 
unsatisfactory because grading standards vary in different schools, a drum 
Scores are a valuable supplement. The best tests are 5 (quantitative = wine 
ence), 4 (technical vocabulary), and 7 (scientific interpretation ). ac 
tests, combined with high-school average, correlate .51 with freshman put 
Note that these tests are all measures of achievement. The prediction n 
this three hours of testing would not be bettered by adding the tests : 5 
lower validities, so tests 6, 10, 11, and 13 were dropped. Tests 1, 8, 9, an Pe 
and a revised form of 3 were retained because it was thought that this intor 
mation might be useful in predicting advanced courses. diss 
The California data do not tell clearly whether the special battery a 
better than a general predictor test would, For evidence on this See 
turn to a report from the University of Utah (Pierson and Jex, 1951). 
Utah College of Engineering had used the Pre. j det 
hour battery (which has subsequently been shortened to an iid iat 
test) covering mathematics and scientific comprehension. It was foun Add- 
high-school marks predicted college engineering marks with an r of .57. y: 
ing subtests of the Pre-Engineer ing Inventory raised the correlation to ab ith 
68. The nonspecialized Coöperative Achievement Tests, combined hr 
high-school average, correlated about .65 with engineering grades. (Just V 


i 50 His One 
the correlation at Utah is so much higher than at California is uncertain 
possibility is that the California e 


h ix- 
-Engineering Inventory, a $ 


re- 
ngineering students are more severely P 
selected.) fes- 

This result is consistent With studies of other tests, A specialized pro 
sional aptitude test is not a 


à ictive 
this rule are dentistry, where spatial and dexterity tests have predic 


discussed above. The main reason for having ae 

administrative, Such tests are ordinarily en 
through the national pro. association or some comparable grouP ues- 
unselors. This protects the secrecy of rail 
irly as a basis for admission to profess 


meas 

16. Why do measures of past achievement predict college marks better than grouP 

ures of general mental ability? In view of this fact, what function can 
mental tests perfi 


orm in college admission and counseling? o 
i ? R ‘ Scho 
17. How could a high school coach students desiring to enter Engineering 


M 
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at the University of California so as to raise their aptitude test scores? Is it de- 
sirable to use tests which allow such coaching? 


Suggested Readings 


Fleishman, Edwin A. Psychomotor selection tests: research and application in the 
United States Air Force. Personnel Psychol., 1956, 9, 449—468. 
This is a description of the Air Force psychomotor tests and a summary of re- 
search on factors underlying them. 
Schultz, Harold A. Review of the Meier 
(ed.), The fourth mental measurements yearbook. H 
Press, 1953. Pp. 338-340. 
In a critique of the Meier test, largely 
ity of the items, Schultz indicates how 
even this one aspect of artistic ability. 
Traxler, Arthur E., & others. Validation of p 
ings, 1950 Invitational Conference on 
tional Testing Service, 1951. Pp. 18-54. 
A symposium describes efforts to develo 
dentistry, and medicine. 


Art Tests: I. Art Judgment. In O. K. Buros 
ighland Park, N.J.: Gryphon 


from the point of view of content valid- 
complex, and how little understood, is 


rofessional aptitude batteries. Proceed- 
Testing Problems. Princeton: Educa- 


p special tests for accounting, law, 


12 


Personnel Selection and Classification 


WHEN we wish to 
could look in a test ca 
and begin using that test f 
quired to establish a selection progr 


intended to predict a p 


oquired 
a test made for quite another Purpose. More than one test may be requ 


to cover all the aptitudes a particular job demands. Another problem is - 
jobs are difficult to classify. Some mechanical jobs seem to make pud 
demands Which almost anyone can satisfy, whereas success in other jobs v ai 
similar titles depends almost entirely upon the psychomotor factors. No pel 
een developed and how thoroughly its sicat oi 
WS how wel] it will predict in a particular prac 


T educational admissions officer can a 
accept a test solely on the basis of research out 
re. Sooner or later, nearly every test worker must M are 
determine whether his prediction methoc mple 
tester may limit his studies to relatively I s 
now the full procedure for validation rese? 

Sic logic of any study of prediction. z 
hed among various types of decisions for 


qM 
z i erifica 
tests are > Classification, evaluation of treatments, and v 
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PROCEDURES IN PREDICTION RESEARCH 


To predict success in a job one chooses a number of tests for tryout, deter- 
mines their effectiveness experimentally, and devises a plan for using test 
ne procedure relies on crude trial and error: 


Scores in making decisions. (0) 
“shotgun” battery of all kinds of tests in the 


the experimenter assembles a 
hope that one or more of them will prove effective. This method is declining 
as we understand better why some tests are valid and others are not. Psychol- 
Ogists developing test batteries today devote considerable thought to the 
characteristics of the job and the establishment of adequate criteria, as well 
as to the search for promising tests. 

The stages in prediction research are as follows: 


Job analysis, to determine what characteristics appear to make for 


success or failure. 
Choice of possibly useful tests to measure these characteristics. 


Administration of tests to an experimental group of workers. 

Collection of criterion data showing how the experimental group of 
workers succeeded on the job. 

Analysis of the relation between test sc 
installation of most effective selection plan. 


ore and success on the job, and 


Job Analysis 
b to be predicted. This analysis sets up 


The first step is to analyze the jo 
abits contribute to or limit success in 


ypotheses stating which abilities and hi 
the job. No machine-like procedure of checking off one by one all possible 


factors has ever been found successful. Instead, the psychologist studies the 
task with whatever insight and psychological knowledge he can muster. Job 


analysis sa = 
alysis is in large measure an art. 


In order to make a successful analysis, one must first of all have wide back- 


ground in psychology. Understanding of motivation, motor habits and the 
Organization of abilities, and knowledge of the multitude of tests now avail- 


able are required, Detailed motion analysis will suggest what dexterities or 
Coórdinations are important. Analysis of the stimuli to which the worker re- 
Sponds may suggest need for certain perceptual or sensory abilities. One fre- 

and poor employees. Simple studies 


Lowe approach is to compare good 
Often reveal essential differences between good and bad performers. Study 
Of workers in training is hel pful, since their difficulties in learning may show 


what aptitude is needed to avoid failure. Research on prediction for other 
jobs draws attention to tests worth trying and sometimes suggests that cer- 
tain tests can be eliminated without further trial. No routine or stereotyped 
approach is likely to be successful, however. The analyst must take off from 
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the experience of others, but unless he brings in new hypotheses he is un- 
likely to find a better method of predicting. 

A job analysis should be highly specific. One should not state that success- 
ful workers have "mechanical ability"; one should instead define the ability 
as "knowledge of and ability to apply principles of gears," or "speed in rou- 
tine two-handed manipulation, not involving much finger dexterity or think- 
ing." Such clear definitions permit one to obtain or construct the most appro- 
priate test. The analysis should not be confined to "aptitudes." It should 
range over the entire field of abilities, habits, personality characteristics and 
interests, previous experiences, knowledge, physique, and so on. 


Observers. A more Systematic procedure for collecting opinions from a larger 
body of informants and reducing the effect of folklore on the job analysis 1s 
the “critical-incident” technique developed by Flanagan (1954). The analyst 
asks a foreman or some other person well acquainted with the job to think of 
an individual who has done excellently on the job, and then to recall one par- 
ticular incident which showed this person's superiority. Likewise, the inform- 
ant recalls a poor performer, perhaps one who had to be discharged, and the 
incident which led to the final verdict of unsuitability, These incidents are 
concrete, and only one Stage removed from feld observation of good and 
Poor performance, as can be seen in these two examples (Preston, 1948): 


up to land on runway 9. He was told to go around ge 
again. This time he overshot and had to go arou 


is ac- 

prinway 15 spotted and he said “Roger.” After ^in 

y runway 15 and almost "spun-in" trying to ih hat 
to go ahead and land. He came in 


E : : 
*5 a pilot for general officers this lieutenant has ae? 
ent upon himself th 


P " n- 
d Major General on board. Immediately upon being C? 


> ; ; me oF 
any other superbas rather crusty old bird—he, without calling on 


: oron in time 
; arranged for his departure to the original destination 1n 
inal aircraft. 


Purposes of prediction. 
method col] 
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ant is likely to bring to mind incidents which support the stamina theory, 
and to forget the cases in which drivers made themselves valuable by recog- 
nizing mixups in their orders. The person who classifies incidents likewise 
can introduce stereotypes into the final result, but this disadvantage is pres- 
entin any judgment of job requirements. 


1. Prepare a list of the factors composing aptitude for one of the following jobs: 
making pie dough, operating a calculator of a particular type, driving a taxi, or 


schoolteaching of some one type. 
2. For many jobs requiring long training, e.g., physiotherapy, it is undesirable to 

take girls into training who will probably marry and drop out. What charac- 

teristics might distinguish between probable marriers and nonmarriers? 


Choice of Tests for Tryout 


Having a list of characteristics presumed to be important in a job, the in- 
Vestigator must then find tests to measure each. He must make a choice be- 
tween seeking one test which is a composite of the job requirements and 
seeking a group of tests, each of which is a pure and independent measure of 
one of the characteristics. The former method, which usually leads to tests 

design a new test for the 


of the job-replica type, requires the investigator to 
ering mechanical, psychomotor, artistic, and 


mplex test which comes close to the re- 
higher validity coefficients than simple 
Ily useful when a number of them 
b replica has distinct disadvan- 


job. As we have seen in consid 


idance it is more economical to use a 
jobs than to use a separate test 


for each job. 
9 Work samples must be revised, restandardized, and revalidated when 


any change in the nature of the job is made. A battery of simple tests can be 
Tevised to fit minor changes in the job by altering weights assigned to the 
tests or by adding perhaps one more test. 

Assuming that the investigator decides to use many tests, each for a partic- 
ular function, he must then choose among available tests or construct new 
Ones. If the abilities the job seems to demand are already measured in pub- 
lished tests, such tests should be tried. Naturally, not every test with a rele- 
Vant name will be suitable; the investigator must consider the difficulty of 

€ test, its appropriateness to the intelligence and education of his subjects, 
and the like. If the job calls for an ability only approximately represented in 
available aptitude tests, it is more desirable to make a new test to measure 


this ability than to obtain a pale distorted image of it from an indirect meas- 


ure. Without condemnin g the useful TMC, we can use it to illustrate this 
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point. The items measure some general factor, but there are also group and 
unique factors among its items, which are drawn from all portions of physics 
and mechanics. To select men for advanced electrical training, background 
and comprehension are significant; but it is probable that Bennett's items 
dealing with electricity will give better correlations than his items on forces, 
motion, and buoyancy. Inclusion of the latter items might, in fact, dilute 
the test so that it will fail to select good workers, whereas a test covering only 
electrical comprehension might be a good predictor. Unless there is a close 
psychological correspondence between an available test and the job, a new 
test of the ability must be constructed. 

There is no simple answer to the question of published versus hampa 
tests. Tests validated on many jobs are of distinct advantage in educationa 
guidance, and to a lesser extent in employment work. Counselors would pre: 
fer to predict all jobs with a few tests. But the test designed for a specific job 
often has significantly higher validity than the test for general use. 

In addition to tests designed to yield predetermined Scores, many selec 
tion studies employ tests which are no more than collections of heteroge 


of these is the biographical inventory, Lan 
capable of predicting success. There is i 
y of this type, but it isa simple matter to pr 

items covering work and educational "an 
» Social activities, and family. In this miscellan * 
ach item is treated as a Separate test, and its relation to SU 
cess on the job is examined empirically, Each job will correlate with some ? 
the items, and a score based on just those items can be used to predict D 


r ng, O! 
ame technique of trying out a mixed assemblage 


Si «od to in- 

criterion can be applied pre- 
: . n 

scored in this manner have often P!” 


ailures in the first group studie: : 
pes of tests studied, the only ones yielding at 
© à pencil-paper test of knowledge about aoe 
> instrument comprehension (.32). and mechanical pr? 


; o 
Guilford, 1947). Two of these three tests clearly depe”! 
perience, 


cients higher than 30 wer 
bile driving (.82) 


Experimental Trial 


€ crucial step in icti just 
5. One gives th redietton research is experimental trial of the ond 
€ tests to typical applicants and observes the corres 


Th 
ment: 
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ence of test scores to success. In practical work there is much pressure to 
omit the experimental study; this pressure must be resisted. When the psy- 
chologist reports to his boss that he believes test X will eliminate poorer em- 
ployees, the boss is far more anxious to install the test and benefit from it at 
once than to withhold judgment during weeks or even years of investigation. 
Full experimental trial is indispensable. No hypothesis can be trusted, be- 
cause there have been many instances in which "likely" tests proved to be of 
no value in selection. 

The nonpsychologist may propose to use the test to eliminate poor men, 
and to study the survivors to determine the relation between test and per- 
formance. This is not a satisfactory plan. A test might not predict which of 
the acceptable men would do well on the job even though it could weed out 
failures, (Example: A hearing test would rule out some people as music stu- 
dents; but within a selected group, all of whom could hear, it would not pre- 
dict Success.) Trial on an unselected group is necessary, moreover, to estab- 
lish critical scores and weightings of tests. So important is experimental trial 
9n an unselected population that the Air Force went to the trouble to vali- 
date its selection methods by sending through training 1800 men, a random 
Sample of all eligible recruits, even though it knew in advance that the ma- 
jority of these men would be failures (DuBois, 1947). 

Subjects should take the tests with the same motivation that would exist in 
their ultimate use (see p. 53). The investigator will try more tests than he 
can use in his final prediction battery, since some will probably not be help- 
ful. This makes the trial battery long, and special attention must be paid to 
maintaining coóperation from the subjects. Sometimes one test can be tried 
at a time, but sooner or later the entire selection battery must be validated 


9^ a single group. 
4. Suppose an employer puts a test in use without tryout. What harm can result 
from this, assuming that the validity of the test is zero or low positive? 


The Criterion 


After giving his tests, the experimenter waits for evidence of good and 


poor job performance. The experimental group is treated in the sume way 
as other workers, being given normal training and duties. ni panei 
terval, data on success are obtained. Among the criteria oftei used are quai- 
tity of production, quality of production, turnover, and opinions of foremen 
9r supervisors. As was explained on p. 108, it 5 p een hen thg — 
Possess a high degree of validity. A test which pe predict quality af work 
will seem to be a poor test if itis judged by a criterion which does not fairly 
indicate quality of work. The criterion (or set of criteria) should cover all 


"Important aspects of the job. 


Criteria may be based on measured output, field observations, or ratings. 


330 ESSENTIALS OF PSYCHOLOGICAL TESTING 


rva- 

The criterion must have high reliability. An adequate number Es i inia 
tions representative of normal performance is required. Bu pro stability, 
is the criterion, it must meet the usual requirements of pprap a 
and validity. Ratings are particularly common criteria e ios P ad hn 
many errors. Methods of making ratings more dependable are 

later chapter - 506 ff.). . if- 
: lunes i r measure of success is suitable. One eara "4 — 
ferent workers seem best at different times. The fast learner w no wit a 
at the end of a short training course may not give as good imei : mars 
à learner who continues to gain in ability after he starts work. = lid 
who makes good grades in training sometimes lacks temperamental q 
for success on the job. reas eae est 

In every study, dere is some hypothetical “ultimate wegen m if 
represents what the selector desires to obtain. The medical sci vis rotha 
it could, judge the success of its students by their lifetime contribu ersonality 
community where they practice. This probably depends more on P rades 10 
attributes than on abilities; it certainly is not very closely related ordo " 
biochemistry, The student's grades, however, are likely to be the Tle ET 
any selection research done by the medical School; they are available, edical 
ent who never graduates will make A ap js tho 
sive effort to study an ultimate rra " 
in Korea. Teams of observers and gees 

at to obtain information on portoen ie i 

data were supplemented by ratings from field commanders. The vali nd in 
an Army test battery developed to predict performance in tr aining ane a 
maneuvers was .27 against these peacetime criteria, but only .17 cures 
combat criterion, A battery developed using the combat criterion wee a be 
36 with both training and combat criteria. The important differen ved 
tween the two batteries was the inclusion of a personality questionna 
the combat-valid battery (Willemin et al., 1958) seria for 

More and more attention is being given to establishing plural cri 


rS 
at cause failure among poor su 
lare necessary types. It is wes 
" 1 T 
te for comparing these diffe 


ossible 


type? 
to find a single cri t typ 


terion that is adequa 
of teaching succe 


SS. 


roce 
Atiking study by Lennon and Baxter (1945) inverts the sins va 
dure and determines what aspects of the criterion can be predicted y 
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able tests. Each item on a ninety-item checklist applied to clerical workers 
was correlated against a revision of Army Alpha and an aptitude test requir- 
ing alphabetizing, number checking, coding, digit counting, computation, 
and reading of tables. For some aspects of job performance, correlations with 
the predictors were high, but other qualities were not predicted. Checklist 
items were well predicted which dealt with understanding of the work, 
quantity and speed of work, performance of multiple tasks, and avoidance 
of duplication of effort. Faults in quality of work, typing, shorthand, gram- 
mar, orderliness, and “personality” were not predicted successfully. Some of 


the results are shown in Table 45. This study shows why it is difficult to pre- 


dict such a composite criterion as “supervisor's rating of all-round perform- 


ance,” 


TABLE 45, Percentage of Office Workers Having High and Low Aptitude Scores 
Rated as Having Particular Characteristics 


Learning Ability Test Clerical Aptitude Test 


High 27% Low 27% | High27% Low 27% 
Characteristic (N = 58) (N = 58) | (N = 58) (N = 58) 


Als Working instructions have to be re- 


Peated frequently 7 12 5 5 
Nanded helpful suggestions about work 8i E bs x 
ten 
emn me o. mw 
easing it rk for errors before re i " be 3 
geo. forgets matters which should " à 3 : 
si Prompt attention : : 4 7 


incli z 
"lined to sacrifice accuracy for speed 


P : i ret 
trathe face type indicates that the difference between 
Soung ® the result of chance in sampling. 
"CE: Lennon and Baxter, 1945. 


low and high group is probably a true difference, 


When records have been collected to show which workers are most suc- 
á 


cessful, the final procedure in a selection study is to process the data and 
identify the best predictors. Before discussing the analysis of prediction data, 
it will be desirable to see the entire research process by examining an actual 
study, 


5. What procedure might be suggested for selecting clerical workers, in view of the 


finding; of Lennon and Baxter? 

ist several independent (nonduplicating 
7 ats teacher success. 

` -ISt several independent criteria to co! 
8 quipment firm. Branches are respons! 
* McNemar (1952) makes the following c 
clinical psychology: “It is sheer nonsens H 
testing and assessment prediction program w! 


6. ) criteria which might be used to evalu- 
nsider in judging branch managers of an 
ble for both sales and service. 

omment about a study of performance in 
e to have proceeded with an extensive 
thout first having devised satisfac- 
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^" i ucted 
tory measures of that which was to be predicted.” Yet E hee 
by well-qualified and experienced persons and supported yr at tat ibd 
tion from equally responsible psychologists in the Veterans m 
arguments can be given on each side of this controversy? 


Development of a Stenographic Aptitude Test 


n ail. A test 

Among girls who undertake the study of shorthand, quite a few = oat 
which could be given before training would save time and ipei du m 
stenography courses and permit girls to select a more PPP a 
Moreover, tests of general intelligence and the commonly aa ee Dene 
have had only moderate success in predicting failures in aa i of pre- 
therefore decided to develop a test geared specifically to the pror : the test 
dicting success in learning stenography. The abilities to be built ‘aed signifi- 
would be those important in stenography, even if they were of r 
cance anywhere else in business or education. -thand sys- 

A job analysis was made. It was based upon a study of the ge esi 
tems and the nature of the job, rather than upon observation o sur 
phers. The resulting list of abilities was as follows (Deemer, 1 ` 
During dictation: The 
the less efficient in: 
l. Ability to listen to what is beir 
she attaches meaning to each w 
Ability to write correct outlines 
Ability to hold a number of wor 
Ability to be “behind the dictat 
- Knowledge of symbols for co 

have to compose more outline: 
6. Thoroughness in checking, d 


der to 
, rior 
more efficient stenographer will probably be supe 


f ility with which 
ng said during dictation, i.e., facility w ith 

ord dictated. 

fluently and rapidly. 

ds in mind while writing others. 

or" without becoming flurried. shar 
mplete words. The less efficient stenogray 

s sound by sound during dictation. --— 
uring pauses in dictation, the outlines ju 


gU iR Coto 


will 


written 


jor 
i " superio 
During transcription: 'The more efficient stenographer will probably be SUI 

to the less efficient in: 


à f ; on the 
1. Ability to judge from the length of her notes where to begin the letter 
page. 
2. Ability 
8. Ability to re: 
a. To call u 


are neat and clean. 
has written. oniZ- 
s for which an outline stands, either by eae 
r by deciphering the outline sound by 50 
> the word that fits the context. 
Ability to Spell the words, 
Ability to type th 
Ability to judge 


Doe 


words accurately and rapidly. 

how far ahead to read before beginning to type. 
This list of abilities 
might be expected to 
abilities which would 


Js 
ich all g” 
Was shortened by eliminating aptitudes which se 


sg tho 
"WT n ia 
possess to an adequate degree, by elimin 5 abil 
be developed in training, and by combining $ 
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ities. Preliminary forms of the test were then designed, using the following 


exercises:! 


Speed of writing. Girls were required to copy the Gettysburg Address - ciii 
as rapidly as possible. This may be considered a coürdination or rate of move- 
ment test, duplicating many motor elements of shorthand v — - 

Word discrimination. In this test, girls choose which of two words bes : s a pa 
ticular context, as in “We are satisfied that our (personal perc. P com- 
pletely loyal to the firm." This simulates the problem in E ene of cl pasing 
the correct word when an outline fits more than one wor. T en ex in 
tellectual action involving verbal por rigo m - e m din. 

i i i rri tly spelled, 

e FRG, Gud o T oin Th simulates the problem in short- 
netically by oshen, akshn, vejtabl, bleef, etc. This : gut eei 
hand of recalling the entire word from a phonetic symbol, and 2 a 
tests spelling. 

Vocabulary. , d 

Sentence hin, The tester reads aloud sentences e ies d 
subjects take down. The sentences increase In length, so tha j 


eventually carry many words in mind. 

as to try the preliminary forms of i m 
Some items proved ambiguous, too easy, or too hard x Lp ied i 
final form of the test for validation was then pipas : rr der — 
mined by administering the test to 500 students E g eis ror 
During the next two years, various measures of achieveme k 


For the total test score, the validity coe 
Criteria: 


The next stage in the study wi 


fficients were as follows, for different 


ictati less .54 

tudy: dictation at 60 w.p.m. or 
: p iod yr dictation at 80 w.p.m. or less .65 
5 udy: dictation at more than 80 w.p.m. Z0 


0 w.p.m.) .58 


“curacy of transcription after two years s dd 65 
R two weeks after dictation (90 w.p.m. or Hs a 
ate of transcription at end of two years © 


udy 
; justify using the test to identify 

These validi i high enough to justity 

idity coefficients are hig J I 

girls likely have difficulty in the course. a ide benaos hos eye 

that many false predictions will be made in individua 2 

usually prefer to use the test to point o 

tention from the teacher rather than ar 


from trying shorthand. 


ut those who should have special at- 
bitrarily to bar girls with low scores 


i t represented in the final test? 
9. wW dere fiii job analysis are noi i 
" Drei Spei pel reliability coefficients are reported for this test be- 
i manua 7 


idi fficients. If the 

i the reported validity coefficients 
Cause it i dd nothing to the T : : 3 
ved, i teas hp Ape the reliability coefficient must be satisfactory. 


d reproduced by permission of Science 
Rec Items copyright 1944 by Walter L. Deemer ane’ Tep 
SSearch Associates. 
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Although the latter sentence is defensible, one sometimes wishes reliability 
data also. What important questions about Deemer's test would be answered 
by reliability coefficients for subtests and total score? . 

11. What explanations can be offered for the failure of the validity coefficient to 


reach 1.00? What does this imply regarding ways to improve the prediction in 
further studies? 


12. The DAT battery, developed later than Deemer's test, includes measures of 
spelling and verbal ability. In a sample of 43 girls, the validity coefficients for 
predicting shorthand grades are: VR, .45, Spelling, .68, Sentences, .58. Does 
Deemer's test appear worth using when DAT results are available? If neither 


test has been given, which should a school adopt to reduce the number of 
failures in shorthand? 


DRAWING CONCLUSIONS FROM SELECTION TESTS 
Strategy of Decision Making 


: " PT i to 
The test scores, once obtained, are translated into decisions according : 
E : ar 
some plan. The plan or strategy describes how scores from various tests 4 3 
à : 4 P : ation, an 
to be combined, how they are to be combined with nontest information, 4 


what decision will be made for any given combination of facts. 
For the moment, we sh 


^ All persons below this score are rejected. " 
mined from the scatter diagram or mes 

table. The validation data indicate what degree of success is to be cae 

evel. The decision maker decides what leve 


ged from entering the School of Engineering. to 
s RE gy borrowed from medical diagnosis is app 
decisions involving two 


: “fail” 0 
definite categories, such as “succeed” and “fail” 9; 
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FIG. 59. Scatter diagram for the ACE test as a predictor of engineering 


grades (Sessions, 1955)- 
by the test ( the “negatives”), the persons who should have been identified 
aS positive are called “misses.” No special name is given to the true negatives, 
who are ordinarily much the largest fraction of the sample. In the engineer- 
"ng study, there were 16 positives out of 147. Of the 16 positives, 14 were hits 
and 2 were false positives. The test let 32 misses slip into the engineering 


School, 
et a cutting score directly from the raw data in 


ing data, it looks as if there is a marked 


difference between persons scoring 85-89 and those scoring 80-84. Three- 
fourths of one group pass, whereas none of the others pass. This abrupt de- 
Cline is almost certainly due to the fact that only a limited sample was stud- 
led. With more cases, there would be more misses in the 85-89 column and 
Some false positives i the 80-84 column. Figure 60 permits us to estimate 
What will happen in a large population. The dots in that figure are the pro- 
Portions of failure in each five-point interval. The line fitted to the points 
Sives an estimate of the trend of failure in the population of which these 131 
cases are a sample, i.e., of the trend to be expected in other samples. Estimat- 
ing from this line, the failure rate at 85 (the cutting score originally pro- 


PU is about 62 percent. 
m etting a cutting score requires à 
ore of 85, it means we wish to reje 


P is generally unwise to $ 
€ scatter diagram. In the engineeri 


value judgment. If we accept a cutting 
ct persons who have less than an even 
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chance of passing. One college administration might decide that = poem 
afford to admit boys unless they have a 70 percent chance for surv : "i - 
the cutting score should be 105. Another administration might rci 
choice to the student except where there is a very high probability i i es 
For this purpose a cutting score of 75 might be set to rule out those edid 
have a 4-to-1 chance of failing. The administrator who lowers the c Eg 
score reduces his false positive rate; that is, he runs less risk of — en e 
satisfactory student. At the same time, however, he increases his nun 
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ACE Score 
FIG. 60, Probability of success in engineering as a function of test score. 
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a total loss. Th 
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ail) are these: leal from a 
e student will gain a good dea me 0 
Ps out. If admitted he will beco 
hatever he learns. . educt 
> he may be a total loss to A jencies 
» further investigation can perhaps identify de “ance 
à plan in which he has a greater 
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continue testing him by means of his class performance. There is no way to 
continue testing a boy who is rejected. Erroneous decisions to reject cannot 
be corrected. 

On the other side, the arguments for a high critical score include these: 

9 Accepting a boy who is unlikely to succeed wastes educational re- 
sources, He takes staff time which might better be spent on more promising 
students. His presence in the group lowers the level of discussion and thus 
robs the better students. 

9 The boy who is going to fail is 
than after he has wasted a year. He c 


Suitable trade or course of study. i i 
In general the problem is to weigh the loss from accepting a failure 


against the loss from rejecting a person who will make good. The proper 
Cutting score is one at which these two risks are in balance. 

The examples discussed above assume that performance increases as test 
Score increases. When turnover is used as a criterion, a different type of rela- 
tion is at times encountered. Turnover sometimes is found to be relatively 
great for men with very high and very low aptitude, whereas men in the mid- 
dle range tend to stay on the job. Ina study of taxi drivers, seven out of ten 
tests of discrimination, motor speed, and reasoning showed such a relation 
to the criterion. Data for two of the tests are plotted in Figure 61. It is rea- 


better off facing the fact at once, rather 
an use the year to get started in a more 
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Percentage of Employees Remaining After Three Months 


13 4 5 6 7 89 
1-3 4 5 6 7 8-9 Speed of Reaction Test 


Complex Arithmetic Test 
FIG, 61, Curvilinear relations between predictors and turnover of taxi drivers (C. W. Brown and 


E. E, Ghiselli, 1953). 


Sonable to suppose that the poorest men drop out because of difficulty on the 
lob, while the best men are able to move to some more satisfying or better- 
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paid job. For a situation like this, a double cutoff plan might be necessary to 
eliminate men at either extreme of the ability scale. , 
Expectancies and critical scores may change rapidly in a period im m 
stitutions are changing. One striking example of such change is the po 
grades of students entering medical school summarized in Table 46. The 1 


TABLE 46. Caliber of First-Year Medical Students 
in Successive Years 


Percentage Having an Undergraduate 
Grade Average of 


A B c 
1950-1951 40 43 17 
1951-1952 30 55 15 
1952-1953 18 68 14 
1953-1954 21 69 10 
1954-1955 17 69 14 
1955-1956 16 71 14 


Source: Anon., 1956. 


admissions are strikingly poorer than those in 1950. The schools evidently 
were able to hold the same critical score for admission, but they attracted £ e 
fewer "A" applicants. For the college counselor and his client considering 
medical school, the change is of great importance, An undergraduate with @ 


z t 
B+ average would have been only an average medical student in 1950, " 
in 1955 he would have been near the 75th 


(at present unexpl 
reduce their dema 


candidates, 


13. In Figures 59 and ó 
with one chance in 


5 ents 
0, what cutting score would be used to eliminate stud 
14. What assumptions 


three of failing? 


15. tal 


16. A large offi 


17. Which is t „ations 
a Patiente el eae false Positives or misses, in each of these pae ugh 
S a hospital are given a reasoning test which gives €, nev- 


indication of organi z 
c brain dama. 5 : horova 
rological examination, ge. Positives are given a t 
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b. Candidates for admission to teacher training are screened for ability. 

c. It is important to hire skilled sheet-metal workers to fill vacancies, during a 
time of tight labor supply. Men cannot be trained on the job. 

d. In inducting soldiers, a mental test is used to determine which men are too 
dull to be useful to the service. 

€. A company wishes to hire mechanics and put them through an expensive 
training program; success cannot be observed until the end of the course. 


18. The following is taken from a letter to the New York Times: 


on the highway’ will continue until state licensing 
o drive a car on today's highways 
tor skills. These skills are ‘normally’ 
o not. Instruments are available 
te responsibility here to see 


"| submit that 'slaughter 
authorities recognize some simple facts: T 
demands a rather complex set of sensori-mo 
distributed; i.e., some folks have them, some d 
to measure these skills. Authorities have some remo 
that such instruments are used before licensing.” 


a. What degree of validity should be required before tests are used as pro- 


Posed? 


b. If scores are normally distributed, how should the cutoff score be fixed? 


Combining Data from Several Tests 


When several tests are tried out, the results must be evaluated in order to 
decide which test is best, and to determine the most useful combination of 
tests for predicting. If only one test is to be used in selection, we will ordinar- 
ily select that which has the highest validity. An exception to this rule occurs 
when the best test is quite expensive to apply and some simpler test yields 
nearly as good a correlation. A second exception occurs when the tests tried 
Out have quite unequal reliabilities. If a reliable test has the best validity, 
and the runner-up is notably unreliable, the best procedure may be to 
lengthen the latter test to increase its validity (cf. p. 130). Prediction is ordi- 
Narily improved by combining several tests which cover different relevant 
aptitudes, 
ue Hi Correlation and Statistical Weig 

tiple-correlation techniques to select t 


hting. It is customary to employ 
he most effective combination of 


for à 
9r computing multiple correlation are 


. To obtain a high multiple correlation, > : 
tive correlation with the criterion and low correlations with each other. 


ere is little value in combining tests of the same ability; this is equivalent 
o making the original test more reliable, and usually raises the validity only 
slightly, But if a new test measures à component of the job not estimated by 
€ first test, it will improve the multiple correlation appreciably. The exam- 
Plein Table 47 shows how prediction improves when we combine several 
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tests with low validity. It also shows that the multiple correlation dE 
ceiling very rapidly, so that adding tests beyond the first three or four rarely 


TABLE 47. Effect on Multiple Correlation of Adding Tests to Battery 


Correlation with Multiple Correlation 
Criterion (Shop of Criterion with 
Performance of First Test, First Two 
Junior-High-School Tests, First Three 
Tests Boys) Tests, Etc. 
Paper Form Board 43 43 
Stenquist Assembly 26 44 
Steadiness 29 53 
Card sorting 27 .563 
Tapping 18 .580 
Spatial Relations Formboard 336 594 
Packing blocks .28 .5953 


Sounce: Paterson et al., 1930, p. 83. 


is valuable. More elaborate prediction batteries are worth while only pher 
each added test measures a new factor. h 
One can afford to discard tests from a trial selection battery even thoug” 
they have positive correlations with the criterion. There is little value in pedi 
tending a battery by adding even reliable tests, if they duplicate abilities a 
ready measured. The following correlations between tests and eliminate 
from flight training were found by the Air Force (DuBois, 1947, p and 
Pilot stanine ( 

Stanine plus 

Stanine plus 


ie, composite Score on selection battery) 653 
Qualifying examination ^ 


Qualifying plus General Classification Test 655 t 
: The multiple-correlation procedure starts with the test validities and t^ y 
intercorrelations. Customarily, one selects the test having the highest valid : 
as the first member of the composite predictor. Then p» intercorrelation 
are examined systematically to determine which test predicts the ore, 
ies at the same time least duplicates the test alread osi The third "e 
în tum, must be one which overlaps little with the ndm two. Out of the = 
et of weights, which place heavy emphasis on pH v 
cutting score fi Bos) and smaller emphasis on the tests added I nod 
as f b re tor the weighted composite is established in the same ” 
or a single test, ere 
Bo t T eights is illustrated in Table 48, which shows how w “Jier d 
y b Air Force to predict graduation from pilot, hor an D 
aming. The same tests were used for all entering eae P Som 
à was required for each job. In selectins v iy 
imination reaction time and finger : b qu 
ding and arithmetic had very little wel 


and navigator tr 


> Whereas rea 
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TABLE 48. Validity Data and Combining Weights Used in an Air Force 
Classification Program 


Correlation with Criterion Relative Weight 
Test Bomb. Nav. Pilot Bomb. Nav. Pilot 
Printed tests: 
Reading Comprehension 12 .32 al? 8 2 — 
Spatial Orientation Il 09 35 o = * ; 
Spatial Orientation | 412 .38 20 sap 9 6 
Dial and Table Reading 19 is Ve us » : 
Biographical Data—pilot = 54 a = E be 
Biographical Data—navigator — 23 —.03 == 9 -= 
Mechanical Principles .08 A3 .32 = = 8 
Technical Vocabulary—pilot 04 10 a = —= Fe 
Technical Vocabulary—nav. -04 E l = 18 = 
Mathematics 10 .50 .08 — 18 = 
Arithmetic Reasoning Me d e E m" 
Instrument Comprehension ! ih = 1 = B " 
Instrument Comprehension Il aa 26 ‘01 = = oan 
Numerical Operations, front 13 28 ‘02 = x 
Numerical Operations, back Al 19 18 = an = 
Speed of Identification 2 G : 
Apparatus tests: 
Rotary ursul 44 i E. 2 = 7 
Complex Coórdination 8 EU KT 19 6 -— 
Finger Dexterity lg A. 22 27 6 4 
Discrimination Reaction Time 22 EU "30 -= n 4 
Two-Hand Coérdination ao — — 2 
Rudder Control = - 
on from training. 


Non: The criterion for the various validity coefficients is graduation or nongraduati 
OuncEe: DuBois, 1947, pp. 99, 


Navigator score, on the other hand, depended primarily on these intellectual 


abilities, 

A combination of tests makes the greatest contribution when the original 
battery contains independent tests measuring quite different factors. If the 
tests overlap to a large degree, the procedure will eliminate most of the tests 
and it is almost certain that some aspects of the criterion will not be meas- 
ured. One of the major claims of factor analysis is that it will ultimately per- 
mit the preparation of “pure” tests, each measuring one factor and very little 


else, These various factors can then be put together in whatever proportion 
a given criterion demands, whereas an impure test puts both wanted and 
ument is logical enough, and is 


unwanted factors i ite. This arg 
heaves ‘actors into the composite. 
increasingly being attained in such batteries as the GATB. Pure tests have 
not been easy to devise, however. 

ffers weights for estimating relative scores on 


pret this equation: 
7.5V + -3S + .9N + 1.3R 


t weights assigned the first two tests in 


19. The Holzinger-Crowder manual o 
certain general mental tests. Inter 

20 Estimated CTMM standing = 
- How do you account for the differen 


342 ESSENTIALS OF PSYCHOLOGICAL TESTING 


C — sais ients? 

Table 48 for navigator prediction, in view of their similar validity coefficients? 

21. Which of the three aircrew jobs has the smallest psychomotor component, ac- 
cording to the prediction weights? 


Multiple Critical Scores. Instead of pooling all scores into a composite, 
some personnel workers select their tests either by multiple correlation or by 
other methods and then use a strategy known as the multiple cutoff or multi- 
ple critical score. For each test a separate critical score is determined. The 
critical scores are adjusted so that persons who pass all the hurdles have a 
satisfactory probability of acceptable performance. 

The most extensive use of this plan is in the application of the GATB bat- 
tery. Occupational standards are established by considering the average 
score of successful workers in the job, and also the correlation of the tests 
with the criterion and with each other. An example is the standard for the 
job of “mounter.” A mounter assembles radio-tube mounts and connects very 
small parts and wires, welding them in place. The passing standard for this 
occupation on the older form of the GATB is Form Perception (P) 85, Aim- 
ing 85, Finger Dexterity 90, Manual Dexterity 85. The cutting score elimi- 
nates the lower third of the workers, thereby weeding out the most probable 
failures. 

The validity data in Table 49, used to establish the standards, were based 
on 65 cases, The validity coefficients suggest selecting on scores F, M, A, aP 
T, but the high correlation of A with T indicates that only one of the pe 


TABLE 49. i T . 
s ease Used in Establishing GATB Occupational Standards 


E 


Correlation 
l ith Produc- 
Aptitude Score M s.d. da Records 
S—Genera 106.9 15.3 =o | 
V—Verbal 102.2 147 7001 | 
Mee 105.8 13.3 064 
—Spatia . j —.009 
P—Form Perception 1 Tis i ae 015 
lerical Perception 106.2 159 4097 
A—Aiming i 
.229 
T—Motor Speed Haat 125 191 
Finger Dexterity 109.5 1 8.4 437 
M—Manual Dexterity 987 207 353 
Source: 


Guide to the Use of GATB, 1958, p. 1-1.15, 


need be used (In the iming 
4 more ; Aim 
a cutoff of 85 on K [ Tecent form, which does not measure 


mot R S note 
the additional fact that t ze LU been substibutodL) The VEE p. THe 


ding 
ot 
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correlate with performance within the selected group. For this reason, P was 
added as a hurdle. 

The validity of the composite pattern (F, P, A, M) was tested by a tetra- 
choric correlation calculated from a table of hits, misses, and false positives. 
The correlations for three samples, using production records as a criterion, 
were .46, .49, and .52. The composite is thus a trifle better than prediction 
from F alone would be. 

22. In a fourth sample of mounters, supervisors' ratings were used as a criterion. 
How do you account for the fact that the correlation was only .24? 

Comparison of Composite and Cutoff Procedures. The weighted composite 
ranks individuals according to their expected criterion scores, and a single 
cutoff is established. All persons whose composite is above that level are ac- 
cepted. A graphic illustration is given in the left panel of Figure 62; all per- 


High 


Test B Test B 


low Low 


Low 


low High 
Test A 


Test A 
Weighted Composite Multiple Cutoff 
FIG. 62. Two selection plans. 


Sons above the line are accepted. The multiple cutoff eliminates persons who 
are low on either test, with the result diagramed in the right-hand panel of 
Figure 62, For most individuals both procedures lead to the same decision. 
Persons in areas 1 and 2, in the lower-right and upper-left corners of the 
and high on the other. They are rejected 
9Y the critical-score procedure but are accepted when a weighted composite 
!$ used. The persons in area 8 are just above the minimum on both tests; they 


are accepted by a multiple-cutoff plan but are rejected when the composite 


Score is used. 
The advantages of the weighted composite are as follows: 
* It gives additional information, indicating how each accepted man 
ranks within the group. This is useful for identifying men requiring special 


assi i 4 CMS S i ; 
istance during training, or for singling out superior men for special re- 


5Ponsibility, 


sc. à 
atter diagram, are low on one test 
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e By estimating the man's probable success in the assignment it permits 
comparison with his probable success-in other lines instead of merely ruling 
out the assignments where he would fail. 

9 If there is a linear relation between tests and the criterion—a matter to 


be discussed further below—it gives a higher proportion of correct decisions 
than the multiple cutoff. 


The multiple cutoff has these advantages: 
9 It is easier to compute 


pee " is the 
; administer, and explain to laymen than is t 
composite score. 


9 Retaining the scores of Separate tests provides a valuable basis for 
counseling. 

9 When the relation between tests and criterion is configural rather than 
linear, the multiple cutoff can yield a higher proportion of correct decisions 
than the customary weighted composite. 

The essential difference is that the weighted composite acts on the as- 
sumption of compensation among abilities, A person weak in dexterity n 
be accepted if he has exceptional perceptua] ability; strength in the one ! 


a is as- 
presumed to make up for weakness in the other, In most predictions this 4 
sumption is justified, but not į all. 


At one time during World War 
ate antisubmarine li 


chologists established a predic 


II the Navy desired to train men to oper 
ar. By the usual correlation procedure, p 
tion formula: men were screened on pm 
rviving Broup, predictions were based on n 
and severa] Seashore tonal tests. Following 
avy procedure, acceptable men were sent to training school, ane 
in school were assigned to general sea duty. It was there 


fore a serious matter when a man of good intelligence was sent to a i 
for which he was unqualified, since his ability would not be properly use 
Many men who failed in son did so because of wary’ pour Des 
judgment, which listening duty. How had they ei 
f ir high mechanical comp pap 
) raised their composite enough a yere 
an adequate "average" ability; i in 
as they would have been €— 
ance, or navigation. In fact, a few were eal , 
£ where they did well. Ultima 
Configural icti 
tidie eos deo. The estimation equation derived by 3 say; © 
timated scores hs eT à cin den i - pe 
a aeai ding predictors. The criterion sc à Jine? 
Sularly as à function of the predictor score. This 


IBS M 
i teri 
: A certain amount of manual dexterity 


e usual 
a k 
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be required on an assembly line, but increased dexterity above the minimum 
makes little or no difference in the value of an employee if the assem- 
bly line moves at a fixed speed. A worker who keeps ahead of the line pro- 
duces no more than one who can just comfortably keep up. His greater dex- 
terity does not compensate for weakness in some other ability. 

Many writers have condemned conventional additive prediction methods, 
claiming that “patterns” or "configurations" of scores will yield better predic- 
tion. We cannot consider this issue fully here, and confine ourselves to a few 
Summary statements: 

9 The writers who propose to interpret "patterns" are often vague in 
their proposals, and writers mean different things by that term. Some of the 
Writers advocate examination of simple differences between abilities which 
the multiple-correlation procedure automatically takes into account. Others 
are arguing for nothing more complicated than a multiple-cutoff procedure. 

9 The multiple-correlation method can be extended to take into account 
configura] relationships of any degree of complexity. It need not be limited 


by the linear assumption. 

9 In practical prediction proble NR à 
proved adequate to account for the data, and nothing is gained by introduc- 
ing configural formulas unless data are uncommonly reliable. 

9 In a limited number of investigations, configural treatment of scores 
has permitted much better prediction than a linear composite. Most of these 


Investigations have involved personality variables as predictors. 
y suggests what may be done with con- 


ms, a linear assumption has almost always 


One pioneer, highly tentative stud n c 
figural prediction. Frederiksen and Melville (1954) thought it possible that 


Or compulsive students, who work hard on tasks even when they are not in- 


terested, achievement in engineering would be predicted by ability tests and 
Not by interest tests. Among noncompulsive students, however, who work 
ard only on what interests them, they thought an interest test might be an 
Important additional predictor. The investigators used two indications of 

ving interests like those 


©ompulsiveness or unusual concern with accuracy: ha 
of Professional accountants; and having low speed on a reading test relative 


to the vocabulary score. Some of the validity coefficients of predictors against 


ay 
“rage grades were as follows: 


Compulsives Noncompulsives 


. A7 .50 
High-school grades 31 61 
Mathematics test 8 .36 
Interest in engineering —.04 = 


Interest in selling real estate 
the hunch that for noncompulsive students in- 


The evidence thus supports : 
an important factor in success, but that for com- 


t : S 
Srest in the subject matter is : i ias 
Pulsive students interests make no difference. If this relation were verified in 
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into 
ivide men 1n 
further work, the correct selection procedure would be to divide 


be 
: site; ability score waid 
two groups. For the compulsive group, a composite ability sco 

used; for the noncompulsives, 


considering in- 
a different combining formula considering 
terests would be needed. 


Sequential Strategies 


re given at 
In the multiple-cutoff or composite-score plan, all the tests pai du 
once and the decision is then made. For many personnel Peyote ito 
economical procedure is the Sequential plan, in which testing is Mam in di 
Several stages. After each stage, a decision is made to reject M A , 
cept others, and to continue testing those close to the — : a short 
The decisions in the first stage of a sequential plan are based serie 
and incomplete test, and the method therefore makes somewhat € reduces 
decisions than a plan in which every man takes all tests. The plar 


3 s O! 
i ie , ater set 
costs, however, because relatively few men take the second and 1 


: ar tests oF 
tests. Especially when the later stages include expensive apparatus 


all 
interviews, the saving can be considerable. Sequential methods pg e 
Superior to single-stage testing, since great savings in testing cost bret ans 
complished with a very small reduction in the correctness of ae 32-87): 
(Arbous and Sichel, 1952; Cronbach and Gleser, 1957, pp. 48-68, 


be 
i: should 
23. What practical considerations determine whether a sequential plan 

used in hiring workers for training? 


Nonstatistical Combination of Scores 


ation 
ule. 


All the procedures discussed so f. 
of the best possible rule for selecti 
The assumption underlying these 


i rmin 
ar involve an experimental - cam 
on, and rigorous application o 


a Wl 
approaches is that a statistical vi s 
more often point to the correct decision than will a procedure whic ] orienta- 
on the judgment of a Psychologist. The psychologist with a clinica e to the 
tion often complains that such a mechanical] method is u as wise 
unique characteristics of the particular case and cannot possibly 
as a psychologist, 


It seems that 
on the data, sh 


o the 


hant- 
ep D P . A mec 
ess in the training to which he was assigned 


an 
: edge 9, 
combining two tests (Electrical Knowledge | 


:cian 
; i "ip ctricià 
Arithmetic Reasoning) Correlated 50 With success in training of ele 
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mates. The interviewer's rating, based on these tests plus judgment, corre- 
lated only .41 with success. In other words, judgments departing from the 
statistical formula reduced the correctness of prediction (Conrad and Sat- 
ter, 1945), P. E. Meehl (1955) made a major comparison of "clinical vs. sta- 
he examined every study where predictions 
ared with predictions made from the same 
twenty studies where such a comparison 


tistical prediction" in which 
made by judges could be comp 
data by statistical formula. In some 
could be made, he found that the actuarial, cookbook prediction was equal 
or superior to the judgmental prediction in every case save one. The statisti- 
cal method, which is obviously cheaper to apply, beats the judge time after 
time, whether the judge be a counselor, a clinical psychologist, or an indus- 
trial personnel manager. 
Why does the judge do so po 
the data by means of an intuiti 


orly? The foremost reason is that he combines 
ve weighting which he has not checked. The 


statistical formula, on the other hand, has been carefully checked on a sam- 
Ple of cases like the one for whom the new prediction is made. It uses the 
best possible set of weights. The judge can beat the formula only by bring- 
ing in additional data and combining those data in the proper manner with 


the facts used by the formula. 
It is very difficult for a judge t 
does not know what weights he uses to 
man from various angles and finally comes to an intuitive decision. Almost 
Certainly, he gives greater weight to some factors than they deserve, and 
changes his weights from one case to the next. Moreover, his judgment is un- 
reliable, in the sense that he might judge the same case differently on differ- 
ent days. The formula never varies. There is some reason to think that judges 
Bive too much weight to the additional facts they add to the test data. 
Judges make many constant errors. They have stereotypes and prejudices; 
for example, they make differen for women from what they 
Would for men with similar scores even when there is no evidence that men 
job. In one of the most interesting 


and women perform differently on the 
because they applied a completely sound 


studies, counselors judged badly Bem ; 
Principle in a situation where for some strange reason the principle was in- 
Appropriate. Sarbin (1948; see also Cronbach, 1955b) asked counselors to 


Predict grade averages of students at Minnesota from their high-school rec- 
Ords, ACE test scores, and a whole dossier of information on interests, experi- 
ences, and yontivadiong, The statistical formula combining ACE and high- 
School rank had a validity of .45 for men, -70 for women. The counselors did 
a little worse: .35 for men, .69 for women. When we examine the weights 
Used, we find that the forr ac pla ced almost its entire weight on high-school 
tank and paid no attention to ACE scores because in the three preceding 
Classes at Minnesota the ACE had made no independent contribution to pre- 


o function efficiently. In the first place, he 
arrive at a decision. He looks at the 


t predictions 
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diction. The counselors, however, gave about equal weight to — 
marks and to ACE scores. Such a weighting had been found best in n on 
the reported investigations of college success, and is quite m Ta 
perience of colleges generally. For some reason, the Minnesota sod ne by 
ing the period of this study was unique; quite reasonable i nd rui 3 
the counselors were the wrong ones for this situation. The statistica eir 
was custom-made for the Minnesota situation and of course it did better tha 
the counselors working from general psychological lore. mem 
What does this imply? It implies that counselors, personnel — d 
clinical psychologists should use formal statistical procedures wher e dat 
sible to find the best combining formula and the true expectancies = nt 
own situation. They should then be extremely cautious in departing fron e 
recommendations arrived at on the basis of the statistics, uum HORT 
strongly convinced that the additional information they bring in is : eris 
basis for decision. When they do use their own judgment, they should n : a 
careful follow-up studies, comparing their number of hits with the mds 
of hits the formula would have yielded. Moreover, those who make Ts 
ments must try to formulate the bases they use, stating just what they tà 
into account and how heavily they weight each bit of information. — 
The proper province of the statistical formula is the institutional deci se 
where a definite and irreversible decision is required to carry out the a n 
poses of the institution. The admissions officer ] bez 
promising applicants for a limited numbe 
decisions in whatever way will be most 
hand, decisions are personal decisions lient 
by the experience table. The counselor's responsibility is to help the © 
understand himself; if, having the facts 
a course in which failure is likely, 


having to choose t us 
r of openings should certainly mn 
accurate, In counseling, on the 0 d 
of the client and cannot be dictate 


ark on 
before him, he wishes to embar 
it is his right to use his life that way- 


INTERPRETING SELECTION STUDIES 


What Is an Acceptable Validity Coe 

As validity coefficients for various tests have been presented in past j^ 
ters, the reader probably has been classifying them mentally as “goo ad 
Poor.” Many tests, Particularly those of special abilities, do not see™ m 
satisfactory at first glance. But, in one sense, a test has a satisfying ne 1 
coefficient if it is better than other tests for the same purpose. The only 50" oS 
ines for Man: à validity coefficient is the question: Does the €! i 
mit us to make a better judo, jS m ake: without it? 
ciently better to justify i anh qi Me Dous m eee pinite Bo 
value on tests With moderate validit a DAS ip 1 "cocfficie? 

y coefficients. A so-called 


fficient? 
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forecasting efficiency" was computed which purported to tell how much bet- 
ter a prediction from the test was than a random guess. According to this 
coefficient, validity had to reach .86 before a test was “50 percent better than 
chance." Tests with validity below .50 were thought to have negligible prac- 
tical value. This line of reasoning, we now know, is based on inappropriate 
assumptions (Cronbach and Gleser, 1957). The reader is advised to disre- 
gard the coefficient of forecasting efficiency and its implications, if he en- 
counters them in his reading. 

Psychologists have abandoned their insistence on validity coefficients of 
70 or .80 for all tests. While we would be pleased to reach these or better 
levels, the experience of thirty years of practical testing shows that we can- 
not often attain such standards. Coefficients as low as .30 are of definite prac- 
tical value (cf. Table 48). Occasionally, a test with much lower validity is 
promising for further development, if it measures what no other test does. In 
discussing this point, Strong comments that the test critic who is contemp- 
tuous of low positive correlations is quite willing to accept information of no 
Breater dependability *when he plays golf or employs a physician." The cor- 
relation of golf scores between the first and second eighteen holes in champi- 
Onship play is, he says, about .30, and the reliability of medical diagnosis near 
40 (Strong, 1943, p. 55). 

In his discussions with executives, t 


State just how much benefit a selection pr! 
9r a military force. He can give a partial answer by comparing selected and 


unselected men with respect to number of failures in training, average length 
of training required, rate of turnover, average production, and so on. All 
these sources of evidence have been used, and all of them show that tests 
With validities in the range from .30 to .50 make a considerable contribution 
to the efficiency of the institution even though they make faulty judgments 


m many individual cases. 
The best single rule of thumb 


he personnel psychologist would like to 
ogram offers to a business, a school, 


for interpreting validity coefficients is the 


9ne developed by Brogden (1949). Making certain reasonable assumptions, 
he showed that the benefit from a selection program increases in proportion 
to the validity coefficient. Suppose the 40 applicants out of 100 who score 
highest on a test are hired. We can consider the average production of ran- 
domly selected men as a baseline. An ideal test would pick the forty men 
Who later earn the highest criterion score; the average production of these 
Men is the maximum that any selection plan could yield. A test with validity 
50, then, will yield an average production halfway between the base level 
and the ideal. To be concrete, suppose the average, randomly selected 
Worker assembles 400 gadgets per day. and the perfectly selected group of 
Workers turns out 600. Then a test with validity .50 will choose a group 
Whose average production is 500 gadgets, and a test of validity .20 will select 
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i der- 
workers with an average production of 440 gadgets. The assumptions un 
lying Brogden's rule are these: 


; igh or low 

The job to be performed remains the same, whether men of high o 

ability are selected. "— 
Production (or other measure of benefit) has a linear relation 

score. 


The benefit derived from a selection plan depends on the into ie: 
as well as on the validity of the test. The selection ratio is the pop eod 
persons tested who are accepted. If there is a large labor supply, rien ved 
tion ratio can be very low, but when applicants are scarce the selec val as 
may be forced up toward 1.00. Even an ideal selection plan has no os ji 
the quality of workers when every applicant must be hired. If one p Ae 
and choose, average output can be much improved. Figure 63 ed 
lation of production to selection ratio and test validity for the AH ums ad 
gadget assemblers used in the illustration above. In this figure s j0 adg- 
sumed that among unselected workers the average production is 4 E 
ets, and the standard deviation is 100. Tests of low validity have : a 
able value when the selection ratio can be very low, when individu 


— 
A "^ i es in pr 
differences in job performance are large, and when small increas P 
duction have a large dollar value. 


a 
eo 
o 


500 


Average Production of Selected Men 


r T 


400 
20 40 60 80 1.00 
Selection Ratio 


FIG. 63. Benefit fro, 


y : idity and 
m a selection program as a function of validity a 
selection ratio, 


under Brogden's assumptions. 


In evaluating a validi 


<- worth using 
ty coefficient to decide whether a test is WOY 
for selection, one must 


ask the following questions:? 
* The questions are so worded th; 


ively 

; of relative 
at an answer of “no” indicates that tests of 7€ 
low validity are likely to be helpful. 


PERSONNEL SELECTION AND CLASSIFICATION 351 


Are individual differences in job performance or other outcomes fairly 
small? 

Can we afford to discharge or transfer to other duties men who prove to 
be unsuccessful? Le., can we tolerate "misses"? 

Is it important to hire every applicant who will be satisfactory, even 
though this also involves hiring many men who will fail? I.e., must we avoid 
"false positives"? 

Does this test measure an ability which is already fairly well measured by 


other tests or procedures already in use? 


Is it possible to modify the job so that it makes less demand on the apti- 


tude tested? 

Is the validity coefficient much lower than the reliability coefficient of the 
test? (If not, lengthening the test should raise validity.) 

Is administration of the test difficult and costly? 


24. In the light of the foregoing questions, how satisfactory is Deemer's validity 
coefficient of about .65 for selecting shorthand students? 

25. In one pilot-selection study, the predictive validity of pencil-and-paper tests 
was .64 (elimination-graduation criterion). The coefficient was raised to 69 
when apparatus tests were added. Is such a small increase worth while, in view 
of the questions listed above? . "m 

26. State employment offices use tests to guide workers into appropriate positions. 
A very low selection ratio may be used, since a particular unemployed worker 
may be directed into any one of hundreds of job families. In a particular in- 
surance agency, on the other hand, it is necessary to employ about 60 percent 
of those who apply for clerical jobs. Are the same tests equally suitable in both 
situations? 

27. In which of these situations is there likely to be a fixed number of vacancies, 
and in which can the decision maker set the critical score as high or low as he 
likes? 

a. A parole board decides which prisoners may be released. 
b. An engineering school admits well-qualified applicants. , 
€. A school psychologist identifies mentally handicapped children to be 


placed under a special teacher. i " 
d. A college counseling bureau identifies clients likely to profit from psycho- 


therapy. 
Restriction of Range. Tests predict less accurately when they are applied 


toa homogeneous group. Validity coefficients rise when a test is applied to a 


Eoup with a wide range of ability, and drop when the test is used on a re- 
stricted, preselected group. Many studies are based on selected groups. 
Deemer, for example, did not test how well his instrument predicted short- 
hand learning of all girls. Instead, it was tried on girls already planning to 
take the course. Many girls of low aptitude were not included, since normally 
those entering a shorthand course have successfully completed some work 
I^ typing. If Deemer's test were applied to an entirely unscreened group, a 


igher coefficient would result. 
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"NS 7 Air 
The effect of screening upon validity coefficients is yc i nm P » 
Force study referred to earlier. The validity coefficient o petal for 
pilot selection was in the neighborhood of .37 for men esie à endo 
flight training. When, for experimental purposes, piti inis Mi In this 
ieu hare’ iunc ^t ipe haya s (DuBois, 1947, pp. 103. 
unrestricted group, the validity coefficient rose to .66 ( DuBois, , 

93). : T-— E e job 
i AEA are frequently perplexed when a variable m: win h oe 
analysis fails to predict the criterion of success. The job ana rend "-— 
been correct in listing the ability as essential to the job, om all þe 
have reduced its significance as a predictor. If future app wie bns 
drawn from a similarly selected group, this variable will not help se idis 
tion. But if the tests are applied to an unselected group, the ae! d 
had no predictive value in the restricted group may turn ae 3 eta poor 
predictor. For example, intelligence tests have onsistent te oly every 
predictors of success in teaching. The explanation is obvious: : Y hune 
teacher has survived years of Schooling with at least adequate mus 
which assures a fair to superior degree of intelligence (Figure ne in dé 
those so selected, differences in tested intelligence play little m ac half 
termining success as teachers. Granted that an intelligence test will n ad: 
a school system hire teachers, an intelli mplete a 
vising a girl in high school whether she is likely to be able peers range 
teacher-training course. Failure to recognize the effects of — cans n 
sometimes leads to discarding useful tests. In 1930, Moss develope phe 
selecting medical students which in tryout studies had good iino on the 
with grades. When schools selected students on the basis of seen quit? 
Moss test, they began to discover that the predictive coefficients de vdd 
low. Ultimately, in 1946, the Moss committee was discharged and tudents 
was abandoned. Then, when the test was no longer used to select s 


report 
; to repo 
so that scores again covered the full range, research studies began 

higher coefficients again. 


ior factor in 
gence test is still a major factor 


ates 

"" TE f gradu! 

28. If one were considering the probable success in industrial jobs ddp range 
from an engineering school, what characteristics would have a res 


en 
: have be 
owing to preselection? What characteristics would probably not 

restricted? 


Contamination of Criteria. 
of criteria, which spuriously 
criteria, there is à possibilit 
fluenced by knowledge of 
in their grading by knowl 
higher than his performa 
siderable experience, Th 


n 


" minatio 
It is important to guard against conta s 


a 
raises correlations. Wherever ratings are ari 
y that teachers, foremen, or other a ac 
the prediction data. Teachers may be es a man 
edge of a pupil's IQ. A foreman may ra has con" 
nce warrants because he knows the man h ra 
ese influences raise the correlation between £ 


dies intelligence, or rating 
amination is to keep predictor d 
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All Adolescents Who Might 
Want to Teach 


Adolescents Reaching 


a Teaching Position 


High 


Possible Success in Teaching 


High 
Intelligence 


FIG. 64. Hypothetical data illustrating the effect of preselec- 


tion upon correlation. Dots show scores to be expected if every. 
ninth-groder interested in teaching later enters the profession. 
Circles show scores of persons likely to survive gradual elimina- 


tion as a consequence of low school marks. 


only way to eliminate con- 


and experience. The 
]l criterion scores have been 


ata secret until a 


collected, 


2 s 
9. In each of the following situations, 
suggest an improved procedure to 


a. 


trace how contamination might occur, and 


avoid it. 

de tests to entering college freshmen and 
from the results predicts each student's success. Success is determined after 
two years by noting which students have been dropped from school by the 
school guidance committee for unsatisfactory work. The predictions are kept 
in a locked file and not made available until the two years have passed. 
The psychologist is a member of the committee but does not disclose the 


A psychologist administers aptifu 


predictions. T 
mathematical ability, and other facts are 


Test data ils' intelligence 
on pupils’ intell'g hat they can do better teaching. 


made available to science teachers so t : ped 
Learning in science is judged not by ratings but by an objective test of 


obility in science given at the end of the course. The pretests are correlated 


with this final score. 
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c. Tests for selecting salesmen are being tried experimentally. Because they 
are thought to be valid, the results are given to the sales manager for his 
guidance in assigning territories to the salesmen in the experimental group. 
After a year of trial, each man is judged by the amount of his sales in rela- 
tion to the normal amount for his territory. 1 

d. Flight instructor's ratings are used as a basis for promoting men from pris 
mary to advanced training. It is desired to check the validity of these ratings 
as predictors of success in advanced training. Advanced training is taken at 
the same field, with a different instructor. This man's judgment supplies the 
criterion. 


Criterion Unreliability and Bias. The size of a validity coefficient is limited 
by the reliability of the criterion, A low validity coefficient may be the result 
of poor criterion measurement rather than poor prediction. Grades and €: 
ings are particularly likely to be unreliable, whereas objective measures 0 
achievement can be made very accurate. 

In many studies improvement in validity coefficients is obtained by re 
fining the criterion rather than by continued development of the predictors. 
An example of the effects of better criteria is shown in Figure 65. When 


-60 


9 classes, grades based 
in part on performance tests 


tn 
o 


ES 
o 


/ g 
7 7 classes, grades basi, 
part on practical tes a 
equipment identificati 


w 
eo 


Correlation with Final Grades 
(Corrected for restricted range) 


N 
eo 


6 classes graded without 
standard achievement tests 


10 


ical 
airal Reading Arithmetic Mechanical Mechanical Moch dge 
sification Aptitude Knowledge (Electrica! 


(Mechanical) 
FIG. 65. Correla 


fore and after int 


" r ] a 
tions of Navy classification tests with grades in Basic Engineering Scho?" 
roduction of standa, 


rd achievement tests (Stuit, 1947, p. 307). 
grades for Navy classes in 


in 
CE " on 

i ship's engine operation were based only 
Structor's judgments, school 


: redic* 
grades had rather small correlations with pr® 
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tor tests. But when two highly valid achievement tests were used in allotting 
School grades, the classification tests were much better predictors. It is to be 
noted that the subjective grades were influenced most by academic and in- 
tellectual abilities. When a valid measure of job knowledge and skill was 
applied as a criterion, the prediction rested most heavily on Mechanical 
Knowledge and Mechanical Aptitude. The tests that predict a valid criterion 
may be different from those that predict a biased, incomplete criterion. 


Necessity for Confirmation of Findings 


nce obtained a satisfactory validity coefficient, 
he tends to install his program and stop research. Other workers, reading his 
report of the study, may accept his test as valid and put it to work in their 
Own situations. This practice is unsound. In the first place, any validation 
Tesult is influenced by chance, and correlations will fluctuate from sample 
to sample. Consequently the test which proves best in one sample may prove 
not to be the best predictor in another similar sample. Even when the re- 
Sults are based on a large sample, the particular critical score or the particu- 
lar Weights most effective in a multiple correlation are certain to change 
when a new group is tested. If the same formula is applied in other groups, 
the correlation is sure to drop. Moreover; the supply of men and the condi- 
tions of training change from time to time. It follows that the investigator 
Must redetermine the validity of his prediction technique periodically. 
The weights for a composite score, or the critical scores in a multiple- 
cutoff procedure, are determined so as to get the best possible prediction in 
the sample studied. In the next sample, the same formula will have lower 
Validity, We speak of this as the "shrinkage" of validity. Shrinkage is likely 
to be great when many possible predictors are tried and when weights are 
etermined from small samples. Shrinkage is relatively small when the 


Predictors are chosen initially on the basis of substantial past experience 
and theory, and relatively large in a "shotgun" study where miscellaneous 


Predictors are tried with no particular rationale. 

To estimate properly the validity of any scoring formula one must cross- 
Validate by trying the formula on a sample not used in selecting tests and 
establishing scoring weights. Sometimes the validity remains nearly the same 
In the second sample, but sometimes there is considerable shrinkage. 

The term cross-validation usually indicates 3 second study in the same 
factory or school where the prediction formula was developed. The general 
Psychological reader wants to know how well the formula holds up in other 
dity generalization. We have seen sev- 


Si . ] 
ltuations, This is the question of vali 
eral examples of the fact that formulas cannot be transferred automatically 


to new situations, DAT scores had different validities in different schools. 


When an investigator has o 
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The value of a formula for selecting sewing-machine operators shifted when 
the job emphasis shifted from speed to quality. Minnesota counselors went 
wrong when they gave the same weight to ACE scores as had been used at 
other colleges. No matter how well a selection procedure is validated and 
cross-validated in the original situation, it must be validated anew when it 
is carried into a new situation. Published results help only by suggesting what 
tests should be included in the tryout battery. 


30. In a configural scoring formula, weights may be assigned to every variable 
and to every pair of variables. How many scores are weighted (or considere 
for possible weighting) when a configural method is applied to a set of ten 
subtests? What does this imply about the shrinkage of configural validities 


CLASSIFICATION DECISIONS 


The employment manager and the college admissions officer make genuine 
selection decisions. They hire or admit some applicants and have nothing 
further to do with those they reject. Classification decisions are far more 
numerous than selection decisions, and many so-called selection program 
really lead to classification decisions. A classification decision is one 1 
which persons are assigned to different jobs, courses, therapeutic went 
ments, etc. The task in classification is to assign each person to the job cart 
he can do best, subject to limitations imposed by the number of vacancies i 
each job category. The decision maker is concerned about the subseque 
performance of everyone, rather than just the persons assigned to one A 
ment. The Air Force program for "pilot selection" is really a classifica 
program, because men who do not pass the tests are retained in the ze 
and assigned to other duty, 

The theory of classification testing must probe into the same 


as the theory of selection, There are methods of combining scores for ol 
cation purposes, strategies for 


questions 
assifi- 


: n id The 
assigning persons to fill quotas, and so on » 


H " . i y 1 
methods differ quite a bit from those appropriate in simple iaceat t 
shall not attempt to summarize these methods and the related theore an 
principles, except to comment on the relation between test validity * 


classification efficiency. i 
A test which predicts success within many jobs is a poor instrument t 
classification because it does not tell which job the person can da best r- 
ideal classification test is one which has a positive correlation with P? 
formance in one job and a zero—or better yet, negative—correlat 
performance in other jobs. A general mental test is of little value fo 
ing which curriculum a college student should enter, even though it a 
indicates that he will do well in academic work. ' 


# For a summary of much of tl 
Gleser, 1957, esp. chaps. 6, 9, he theory, and further references, 


jon wit? 
r deci 


tly 


an 
see Cronbach 
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When we apply Brogden's assumptions to classification, we find that the 
value of a test used to assign persons to one of two treatments is propor- 
tional to its differential validity. Differential validity is expressed by the 


formula 
S1T4t — SoT2t 


Here s, and sə are the standard deviations of criterion scores for the two 
treatments for randomly selected men, the two criteria being expressed in 
comparable units such as dollar value of the worker's production. rı; and 
Tat are the usual predictive validity coefficients for test t (Brogden, 1951). 

Looking back at Table 48 (p. 841) let us assume that the criterion stand- 
ard deviations for pilots and navigators are about equal—that is, that the 
difference in value to the Air Force between an ace pilot and a borderline 
pilot is equal to that between an outstanding and a mediocre navigator. We 
see that the Two-Hand Coórdination Test has a validity of .26 for navigator 
and .30 for pilot. It therefore has no differential validity. Numerical Opera- 
tions has a validity of .26 for navigator and 01 for pilot. It is therefore a good 
classification test. The Mathematics test, with validities .50 and .08, is even 
better, 

One of the remarkable values of differential predictors is that they make 
much better use of a pool of manpower than can a general predictor. Sup- 
pose we have three tests, A, B, and C. Test A is a general test which has 
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forjob2 ^^ 
^ 


/ 

Accepted 
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Test B 


Differential Predictors 


Single Predictor 


FIG. 66. Superior use of manpower by means of differential predictors. 


Validity .40 for job 1 and job 2. If we want to rule out below-average per- 


W ivi 
Ormers, we accept the best 50 percent of the men. We inust divide them 
ause test A has no differential validity. 


randomly betw iobs bec 
becas se d for job 2, and zero correlation with test 


Test B h dx É 
b 1, .00 
as VOR eee and .40 for job 2. Now we can accept all 


C. Test C has validity .00 for job 1, B i i 
men above average on either B or C and assign each one according to which 


358 ESSENTIALS OF PSYCHOLOGICAL TESTING 


en 

test he does better on. With these differential tests, 75 na ja h: 
can be used, yet each one is average or better in aptitude for the jo 

is placed. "- ^ isions, 
M sect in clinical diagnosis are essentially hessitientioa be 
i.e., choices between treatments. Even discharging a patient ae n. There 
a return to the community is the most beneficial treatment for Kor det ie 
are usually no quotas to be filled in clinical diagnosis; eee 
called “normal” or every one called “schizophrenic” if such uni eris md 
tion appears correct. Meehl and Rosen (1955) have drawn atten xd 
fact that such uniform classification is often the best strätegy even "han 
are using a test which has significant positive validity. This cil pem “ae 
clinician is trying to identify a rare condition; Meehl and bs a person 
problem of predicting suicide as an example. If a test identifies wien 
with a high probability of suicide, the clinician will probably € : piob- 
that he be given closer attention and more intensive treatment t " ín d tte 
able nonsuicide. Suicide is rare; perhaps 5 percent of those tested : (sid 
tain clinic will later attempt suicide. A person with a low test aed s at 
hypothetical test) may have expectancy of only 0.1 percent of a = with 
tempt, and one can confidently place him in the nonsuicide io N ins 
higher test scores, probability of suicide increases, so that the tes nay indi- 
doubted validity. The highest score in the clinic sample, however, re called 
cate only a 20 percent expectancy of suicide. These people cannot z (false 
probable suicides. If we so diagnosed them, we would make four eo for à 
positives) for every correct decision. The special care appropria not be 
probable suicide is a great drain on the resources of a clinic. It S nti e. 
able to invest this effort in four false positives in order to prevent one 


t 
«chs the cos 
To be sure, the clinic may argue that one person saved far outweighs 


e wi 
of guarding all five, in which case persons with the highest test scor 
be placed in the risk-of-sui 


] . a classifi- 
cide category. The principle still gie " post 
cation test with a positive validity is not worth using if the cost o , 
tives outweighs the benefit from hits. 


sS 

: oodne 

To show that a test is beneficial, it is necessary to estimate the qu 
iy a : i [e] 

of the decisions it leads to. A positive validity coefficient alone is n 

to demonstrate practical usefulness for institutional decisions. 


" import 
31. Suppose that Pilot performance is judged to be three times as imp 


F diffe 
navigator performance (s; = 3s2). Then what tests have greatest 
validity for these jobs? 


ots fro™ 
schi lots 
32. What tests have greatest differential validity for distinguishing P! 
bombardiers? 


:Alatind 
4 ies of viola 
33. A twenty-point test for Parole prediction gives these leer. SS perce" 
parole: for a score of 20, 40 Percent; score 10, 20 percent; score 0, 


jsoners 
ractically by a parole board, or should all pr 
bey probation rules? 


ant o 
renti? 


PERSONNEL SELECTION AND CLASSIFICATION 359 


Suggested Readings 

Bere, John C. The critical incident technique. Psychol. Bull., 1954, 51, 327- 
58. 

Procedures used to obta 

gether with suggestions for using th 


in and interpret critical incidents are described, to- 
e information in measurement and train- 


ing. 
Ghiselli, Edwin E., & Brown, Clarence W. Analysis of jobs. Personnel and industrial 


psychology. (2nd ed.) New York: McGraw-Hill, 1955. Pp. 17-58. 
The authors survey and critically compare methods of job analysis which may 


be used in deciding what tests deserve tryout. 


Kirchner, Wayne K., & Dunnette, Marvin D. Applying the weighted application 
ty of office jobs. J. appl. Psychol., 1957, 41, 206-208. 


blank procedure to a varie ; 
a score derived from personal history can 


A simple experiment shows how 
Predict job tenure. 
errine, Marvyn W. The selection of dr 


57-61. 
A compact report of a selection study on a small scale illustrates nearly all the 


principles and problems of selection research. This technically excellent in- 


vestigation was done as an undergraduate honors project. 


Tiffin, Joseph, & McCormick, Ernest T. General principles of personnel testing. 
lewood Cliffs, N.J.: Prentice-Hall, 1958. 


Industrial psychology. (4th ed.) Eng! 

Pp. 75-109. 
Procedures used in validating tests for i 
he difference b 


Particular emphasis is placed on t c > 1 
employees and studies conducted on new applicants, tested at the time of hir- 
ing but not screened on the basis of test performance. The importance of the 


selection ratio as a factor determining the usefulness of a test is fully explained. 


afting trainees. J. appl. Psychol., 1955, 39, 


ndustrial selection are described. 
etween studies on present 
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training in the various skills needed to maintain a complex fghting ed 
When educators and psychologists were given responsibility for ie P ri 
schools during World War II, one major change they proposed was t imm d 
velopment of standard proficiency tests. The advantages the Navy fo 
may be summarized as follows (Stuit, 1947, pp. 287-354): —-— 
€ Tests aided in holding instruction constant in all schools preparing tod 
for a given duty (e.g, torpedomen). Although objectives of Mare or 
and curricula were standardized, individual schools tended to neglec es 
overemphasize particular topics. Since any neglect would lower test score? 
the tests forced teachers to do as the course planners intended. PE 
€ The tests provided a basis for revising the curriculum and i n 
instruction. If results showed that certain skills were not mastered in the 
time allotted, it was necessary to reconsider the length of the course or 2 
emphasis placed on these skills. Without such tests, instructors often ere 
sumed that, having "covered" a topic, they had taught it. Test results w a 
of great interest to instructors and often caused them to ask supervisors 
specialists to Suggest ways to improve their teaching. ate 
9 Proficiency scores identified classes which were not making adequé 
progress, so that Supervisors could investigate the cause. than 
9 Some tests required the student to demonstrate job skills rather s 
merely to give verba] answers. Such tests directed the attention of a 
tors to the behaviors the course was intended to produce. Reliance on d 
lecture-discussion method of teaching declined, and training improve vie 
9 Tests which placed emphasis on all significant aspects of the lar 
made sure that all-round proficiency would be considered and pog 
noted. In the Basic Engineering school, before the introduction of -— ned 
testing, grades were strongly influenced by performance in a ew 
aspects of the Course, probably because this ability was easy to test. Te was 
examinations stressed mechanical understanding. As a result, aptent s. 
drawn to men Who, despite skill in arithmetic, were poor in other essen 
9 In the absence of tests, grades had been assigned to students y 
basis of subjective impressions, with, at best, the aid of teacher-made ade 
Such marks were unreliable and subject to bias. Even objective tests, Hon 
by teachers without special training, measured poorly. Careful prepa? 
of tests led curate grading. 
lysis of grading standards an d test 
marks were based solely on Hoc 
places represented the same degree of pro to vali 
al marks were a better criterion against which 


sts: 


d reduced 


9 More accurate fin 
date selection tests, 


CHO lo in£ 
€ Motivation of Students and instructors was improved by dave? 
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rivalry based on a fair standard. Showing a man his particular deficiencies 
was useful for motivating and directing study. 

In effect, the program of proficiency testing introduced into personnel 
management *quality control" like that imposed on manufacturing processes. 
Substandard individuals were thrown back for further polishing, or for dis- 
card. Substandard teaching methods were detected and changed. These 
advantages have their counterparts in industrial training and in schools and 
colleges. Marks are unreliable, and may emphasize some aspects of the 


course to the exclusion of others. Different instructors in the same depart- 
vith different effectiveness. Teachers de- 


, ment grade differently and teach v 
part from planned curricula. But stringent control by testing can be un- 


desirable in general education, for reasons to be discussed later in this 
chapter. 
One significant contribution of standardized tests has been to break down 
the "time-serving" concept of education. A person's standing in school is 
frequently judged by the number of years he has put in, or the number of 
Courses he has passed through. Time spent is no index of education re- 
ceived, In one study, where thousands of college students took standardized 
tests of knowledge in various fields, many college seniors knew less than 
the average high-school senior. Since number of units accumulated tells 
little about proficiency, tests are being given increasing weight as evidence 
of educational development. In most communities, an adult who did not 
complete high school may receive à diploma by p i 
use this diploma to enter college if he wishes. Colleges exempt from certain 
required courses those who perform well enough on proficiency tests. Control 
Y proficiency examination is widespread in professional education: Law- 
yers, for example, must take a state examination before being admitted to 
practice, Psychologists wishing a diploma certifying their competence as 
Clinica], industrial, or counseling psychologists take an examination given 
Y the American Board of Examiners in Professional Psychology (ABEPP). 
In this chapter we can introduce only the major problems of proficiency 
testing and illustrate a few techniques that have been used. The psychologist 
and the teacher frequently have to construct proficiency tests. This is an art 
Which requires both experience and technical training. For advice on con- 
Struction of tests for various purposes, the reader should consult such spe- 


ciali 
lalized books as these: 


assing an examination and 


Dorothy C. Adkins and others Construction and Analysis of Achievement Tests, 
ashington, Government P rinting Office, 1947. This discusses test construction 
from th i ivi rice Commission. 
e point of view of the Civil Servi : é . 
ps Tandai (ed.) Educational Measurement, Washington, American Council 
9n Education HOR, This maja handbook on test construction covers both pro- 
" : 


Cedures and theory. 
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J. R. Gerberich, Specimen Objective Test Items, New York, Longmans, gre 
1956. This book discusses the use of objective tests in various fields and di i 
over 200 examples of test items developed for special purposes. It also has exce 
lent bibliographies. 


1. In the United States, a high-school diploma is awarded to almost any pupil DE 
has stayed in the school system for twelve years. In Great Britain, where cm 
persons complete secondary education, the evidence of "completion" is i 
General Certificate of Education, granted not by the school but by a regiona 
examining body controlled by the universities. Only pupils who meet the passing 
standard on a test are given the certificate. What assumptions about society 
underlie each plan? What are the social consequences of each plan? li- 

2. What consequences would follow if ABEPP published the proportion of apP 


rd Pise; niai in 
cants trained by each university who pass the examination for their diploma 
clinical psychology? 


VALIDITY OF PROFICIENCY TESTS 


Content Validity 


Among the four types of validity introduced in Chapter 5, little has ve 
been said about content validity. Content validation is primarily relevant i 
proficiency testing. General and special ability tests, for the most part, = 
ploy one type of content to assess ability to learn to deal with some ran 
content. The typical proficiency test, on the other hand, assesses ability 


: . i n- 
deal with content of which the test is supposed to be a sample, and its ¢° 
tent validity must be established. 


Whereas predictive and concurrent valid 
study of results, content validity is established by logical examination en 
test and the methods used in its preparation. The question is, How well é 
performance on the test Serve as an index of performance on some nit 
“universe of situations?” The test questions are only a sample of all the P 
sible questions that might be asked, and they may or may not be represe: 
tive of the total domain of appropriate questions. d 

Ideally the author of a test would define a universe to be measure 
then sample his items so as to represent that content. To specify the 
verse, he has to define both the stimuli and the responses that concen e 0 
Consider first the stimuli. Each of the following describes a univers 
content in which some tester might be interested: 

9 All the flags used in the U.S. Navy signal system. 


2i a 
© All the words likely to be read in everyday German, i.e., in newsp 
correspondence, etc, 


atistical 


at s 5 t 
ation judge a test by s he 


and 
uni- 
him. 


pere 


digits or less. 


€ All facts regarding Schizophrenia given in a certain textbook. 
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To specify the response he intends to observe, the test developer would 
indicate what he desires the subject to do. Is the subject to name the flag, 
taking as long as he needs? Or is he to recognize it rapidly? Is he to tell what 
à German word means when he hears it? Or to recall the German word 
When given the English equivalent? 

When an author has defined a universe of content, he then can prepare 
à sample to represent that universe. For instance, he might tabulate all the 


Words used in German newspapers and use this as his basic list. A random 


Sample could be taken, perhaps every 200th word on the list. The sample 


could be chosen on a representative basis rather than randomly. The words 
might be grouped according to frequency of use: the thousand most fre- 
quent, the next thousand, and so on. Then twenty words might be taken at 
random from each level. The resulting sample would have an average fre- 
quency of use similar to that of the universe. The representative sample gen- 
erally gives a somewhat more accurate measure than the random sample of 
equal size. 

Formal sampling plans are most used to select items for educational 
tests, Spelling words, arithmetic combinations, shorthand symbols, and 
other collections of factual associations can be catalogued and sampled. In 
subjects like history and science the content cannot be reduced to a list of 
Specific items, but it is still possible to sample so as to represent each section 
Of the course in proper proportion. . . 

Sampling is sometimes very poor in tests developed by an inexperienced 
9r untrained tester. A spelling test may consist of the words someone believes 
Workers should know, rather than words actually used on a job or actually 
Covered in a course of study. A test in physics may overemphasize items on 
the parallelogram of forces if the tester finds such items easy to invent, and 
may neglect topics where he lacks good ideas for items. Competent test de- 
velopers take great care to match their proficiency tests to a careful job 
description or to the course of study. J ; 

If a test is prepared according to à clearly described sampling plan, a 


Prospective user can judge content validity very simply. He needs only to 


decide whether he is satisfied with the author's choice of universe and his 


Sampling method. If a German teacher is interested in preparing his students 


to read everyday German, he should be content with a test based on news- 
Paper vocabulary. If the course is intended to teach literary German or 
Scientific German, the same test would be less appropriate. I UE 
adopt it only with some risk of drawing false conclusions. The student who 
has the largest scientific vocabulary may not earn an especially high score 
On the test of newspaper vocabulary. 
2 Although sampling items from a defi 
ideal, very few of the situations that c 


nite universe is a pleasing and logical 
oncern testers reduce to such simple 
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s 

to know which of four ways of teaching subtraction was n^ cores P 
were taught by the equal additions method and others by the ee r xor 
or borrowing method. Each method was presented meaning yon 
classes and as a mechanical or rote procedure in other classes. - Patr 
of two-digit numbers (c.g. 27 from 41) was the content used sai odie 
tion. On a test of accuracy in two-digit problems there Was sma : heiwec 
for the borrowing method, meaningfully taught, but the E e 
groups was so small that one might advise a teacher to use end lm 
method he happened to prefer. Arguing that an important E ampligited 
ginning instruction in subtraction is to pave the way for os wei ehe 
problems, the investigators also tested the children on em en taught. 
digit numbers (e.g., 858 from 644) even though this had not be fon ë 
On this test the borrowing-meaningful group did 50 percent Better 
other groups; this was unquestionably the best teaching procedure. — 

To take into account the variables that ought to be measured, Qn t BB 
find out just what the objectives of instruction are. Objectives pd aa 
with factual knowledge and a limited group of skills. When yeaa 
teacher why he is teaching his course, he lists a large number s Uu : mido 
The geometry teacher, for example, denies that the purpose ae jns, d 
is to transmit a certain number of theorems, a few practical princip habits 
some skills with ruler and compass. Instead, he speaks of develope Mens 
of reasoning, skill in identifying assumptions, skepticism about iji signe 
conclusions, and so on. Yet achievement tests have traditionally s of the 
specific facts and skills to the exclusion of other important outcome 
Course, 4h and 

In a series of studies of college and high-school teaching ie d 
Tyler, 1942), R. W., Tyler and his coworkers identified a large nu 


follows 
purposes schools claimed to hold. These aims may be grouped as 
(Raths, 1936): 


that 
ledge 
Functional information, Not mere rote knowledge, but knowlecg 

can be applied to new situations where it is relevant. 


Thinking skills and habits. entific inquir 
Attitudes and social sensitivity. ( Tolerance, spirit of scien 
appreciation of music, etc.) ) 


tc. 
rial interests, € 
Interests, aims, purposes. ( Vocational goals, secretarial inte 

Study skills and work habits. 


Social and personal adjustment. 
Creativeness, 


Physical health, 
A functional philosophy of life. 


these 
lo ed 
Before one can determine how well the student has develop 
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qualities, it is necessary to define them in terms of behavior. How does a 
person act who has “ability to draw sound conclusions from scientific data"? 
Skill in interpretation of data is shown by certain definite actions. If we give 
a skilled person a table or graph carrying unfamiliar information, he does 
certain things. He identifies major trends. He disregards fluctuations that 
are due to variations in sampling. He concludes that factor A and factor B 
change together, not that factor A causes factor B. Having defined precisely 
What actions give evidence that the subject possesses the desired skill, it is 
a fairly simple matter to observe those actions and translate the observations 


into a measurement. 
Facts and skills loom so large in the usual classroom that teachers and test 
ortion to other types of out- 


designers have emphasized them out of prop 
come. Although it is important to measure knowledge and skill, a pupil may 
earn a high score on memorized material and yet have made little progress 


TABLE 51. Improvement in Abilities in Zoólogy; Measured at the End of the Course 


and One Year Later 


Mean Score Percent of 
c Gain During 
Beginning End One Course 
of of Year Which Was 
Type of Examination Exercise Course Course Later Later Lost 


Naming ani i in di 22 62 31 7 
ing animal structures pictured in diagrams 
Identifying technical terms, 20 83 7  % 
ailing information 
9. Structures performing functions in type forms 13 39 34 21 
" b. Other facts 21 5 M 21 
PPlying principles to new situations 3 ES E P" 


nt 


S 
OuRcE: R, W. Tyler, 1934, p. 76. 


toward understanding the course. A course in cooking presumably is sup- 
Posed to improve ability to cook. But one teacher tested a college class on 
knowledge of scientific principles underlying cookery, and also had them 
cook food. The quality of cooking correlated only .25 with the verbal knowl- 


edge (Arn 5 5 
>, 19 . 25). 
demi da ht to the argument that thinking and atti- 


Studies of fi i i i 

orgetting give welg E 
tudes should be Ac) Facts poorly understood are quickly dropped 
tom the mind, whereas attitudes and changes of thinking habits are usually 
much more lasting Tyler gave à series of tests to one college class before 


they studied zoülogy, at the end of the course, and again after a second 
Year in which Hey tudied no zoólogy- The most lasting changes were in 
ability to apply principles to new problems and to draw conclusions from 


data (Table 51 y. 


In stating objectives and in designing tests it is especially important to dis- 
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tinguish ability (maximum level of performance) from typical a 
Knowing the right answer is no guarantee that a person will behave p " 
right way. It is easy, for example, to prepare a true-false test for a wc 
“how to study.” After a few lectures on principles of study, most SF o 
know how they should study and can pass the test. But the gap is wide i 
tween what students know about study and what they do about it. in a 
typical behavior are needed to evaluate the effectiveness of courses a d 
ing handwriting, leadership or personnel, management, resistance to B : 
aganda, accuracy in arithmetic, and many other objectives. Proficiency : * 
measure abilities produced on demand. To evaluate instruction fully, i A 
necessary to supplement proficiency tests with observations and other mea 

ures of typical behavior (see Part Three) 


7. For one of the following courses, try to list all the important objectives: 
a. Study of literature in the junior high school. 
b. A course to train union offi 
c. A course to train junior executives in human relations. 

8. Define each of the following objectives in terms of specific behaviors: 
9. "To train young people for wise parenthood." 
b. "To increase appreciation of good literature." 
€. "To prepare young people for the duties of citizenship." 

9. The Brownell-Moser test of ability to do a performance that had not been ta 
is a "transfer" test, died? 
a. Is it fair to judge the student's learning by asking what he has not stu ae 
b. A man is trained to repair certain models of a radar set. What woul 

suitable transfer test and what would be learned from it? | to in- 

c. It is claimed that study of French improves English. Would it be usefu 


s hing 
clude an English test in a research study on a new method of teac 
French? 


cials for collective bargaining. 


ught 


Construct Validity 


ae avior. 
The listing of various objectives implies a list of distinct kinds of beha 


€ "n + n » i m i i 1 
Are "thinking skills really distinct from "functional information"? This 
problem of construct validation. 


j ts 

Tyler's evidence that applying principles and interpreting experi 

are much less subject to forgetting than factual information an 

these abilities are distinct. In a more extensive study of fourteen a nay 

courses he found that the correlations between different types of proficie j 
were quite small. Even after c 


relations were (Judd et al., 1936): 


Knowledge of facts vs. application of principles, about .45 
Knowledge of facts vs, inference from experiment, .35 
Application vs. inference, .40 
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the correlation between a factual test and a test of ability 
s is .40. Assume that grades are assigned 
as follows: 10 percent A, 20 percent B, 40 percent C, 20 percent D, and 10 
percent F. What grades will A students on the first test receive if the second 
test is used as a basis for grading? (Use the scatter diagram on p. 114.) 


10. In a certain course, 
to apply knowledge to new situation 


Effects of Item Form on What Is Measured. Tests having the same “content” 
may measuure different abilities because of variables associated with 
item form. Reading ability, for example, affects scores on almost all achieve- 
ment tests, A valid measure of knowledge is not obtained if a person who 
knows a fact misses an item about it because of verbal difficulties. The Navy 
Mechanical Knowledge Test contained four types of item: mechanical facts, 
tested verbally; mechanical facts, tested pictorially; electrical facts, tested 


verbally; and electrical facts, tested pictorially. Similarity of content pro- 


duced lower correlations than similarity in form (Table 52). In other words, 


TABLE 52. Correlations of Tests Having Similar Form and Tests 


Having Similar Content 
Correlation 
Corrected for 

Tests similar in form, different in content: 25 79 
Verbal tests: mechanical vs. electrical pi 386 
Pictorial tests: mechanical vs. electrical E ; 

Tests similar in content, different in form: " Z 
Mechanical: verbal vs. pictorial 51 74 


Electrical: verbal vs. pictorial 


Tests different in both form and content: 63 
Mechanical verbal vs. electrical pictorial a ‘59 
Electrical verbal vs. mechanical pictorial AS S 

Kuder-Richardson reliability coefficients: 89 
Mechanical verbal <82 
Mechanical pictorial 71 
Electrical verbal 7 


Electrical pictorial 


Source: Conrad, 1944. 


the form of the items largely determined the score received. Another study 
t the verbal element in tests may be un- 


s had been validly evaluated by scores 
onomical substitute, verbal and pictorial 
tion was tested in the two forms, 


Supplemented by words. Questions dealt h 
crew, appearance of tracers when the gun was properly aimed, etc. The 


Pictorial test had a correlation of .90 with instructors’ marks based on gun 
Operation whereas the validity of the verbal test was only .62. The verbal 
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test was in large measure a reading test; it correlated .59 with a Navy read- 
ing test, while the picture test correlated only .26 with reading (Training 
Aids Section, 1945). . 

Speed is relevant and important in tests of typing attainment or reading 
facility, or in tests of arithmetic for cashiers. Speed is irrelevant when we 
wish to know how large a pupils vocabulary is, how much science he 
knows, or how accurately he can reason. Speeding can usually be justified 
in proficiency tests only if the test is intended to predict success in a task 
where speed is helpful. 

Many popular testing techniques are strongly affected by response styles. 
A response style is a habit or momentary set which causes the subject to 
earn a different score from the one he would earn if the same items were 
presented in a different form. In true-false tests particularly, some people 
have the habit of saying “true” when in doubt, while others are aec nan 
cally suspicious and respond "false" when in doubt. If the tester has indude, 
a large proportion of true statements in his test, the acquiescent student w! i 
earn a high score even if his knowledge is limited. Other response aye 
include tendency to gamble, working for speed rather than accuracy, anc 
use of a particular style in essay tests. 

Aptitude tests are also affected by response styles, though to a lesser "i 
gree than proficiency tests (Cronbach, 1950). In one of Thurstone’s spatia" 
tests, the student is to mark all the figures in a row of six which are just like 
à given figure save for being rotated. Some students consistently mark many 
figures in the row, while some mark only one figure even when several are 
correct. This caution, or lack of thoroughness, lowers scores. The Seashore 
pitch test requires subjects to judge whether the second of two tones i5 
higher or lower than the first. Some 


M ard 
students are strongly biased towé 
one of the two answers; in one cl 


ass of ten students, the most biased student 
marked 75 items H and only 25 L. After the class was given a short talk 
the nature of bias, their Scores improved. This particular student gained 1 
points (on a 50-point scale from pure chance to perfect). — 
For measuring ability, multiple-choice or best-answer tests are distinc z 
preferable to tests having fixed response categories such as true vs. false 


. 1 Ee 
agree vs. disagree. The best-answer test is not only virtually immune to 


apte 
sponse biases other than tendency to gamble but is especially well adap 
to testing of comprehension. 


11. A mental test uses items like the following: 


SESERADUD. secours iecore c US i tefie SAME-OPPOSITE 
Obscure-lucid |., SAME-OPPOSITE 
Occult-mystical SAME-OPPOSIT : limit? 

What response styles is such a test affected by when given with a time 


ta a test for the same ability which would be less influenced by resP 
styl e. 
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Recognition vs. Recall. A major issue in educational testing is whether 
recognition tests and recall tests on the same content measure the same 
ability. Multiple-choice and other recognition items are necessarily given 
great emphasis in standardized testing because they are easy to score. This 
has been a source of concern to teachers who feel that only tests requiring 
ately what they teach. Especially where 
the purpose of teaching is to produce ability to recall or invent new solutions, 
teachers tend to prefer frec-response tests. The English teacher prefers to 
judge a student from a sample of his free writing, rather than on tests where 
he merely identifies errors. The mathematics teacher feels that his students 
should be required to solve problems, rather than merely to select alterna- 
tives in what one writer calls “place-your-bet” questions. 

To evaluate this argument requires an experiment to determine whether 
the recognition and recall tests rank subjects in the same way. The result 
of this experiment depends on the ability measured. In arithmetic, the two 
at the other extreme, penmanship performance 
has negligible correlation with ability to recognize good writing. In college 
mathematics, multiple-choice questions had reliability coefficients and cor- 
relations with grades in later mathematics essentially the same as those for 
free-answer questions (College Entrance Examination Board, 1946). 

One might think that ability to generalize from data could be tested only 
by requiring that the student form his own generalizations. Dut Bast te 
quiring undergraduates to identify the best and poorest generalizations from 
a set of data correlated .85 with ability to draw generalizations directly 
from the data, Planning an experiment is a creative function, yet a recogni- 
tion test calling for choice among alternative plans correlated .79 with a 
free-response test of ability to make plans (R. W. Ty lei Bes. WN E 

It seems likely that free-response tests can be superior to recognition tests 
where one is required to measure very accurately. Among graduate students 
Who have overlearned the verbally stated principles of scientific method, 
Probably all would do well on any reasonable objective test of experimental 
design. But even among such students there is marked variation in in- 
Ventiveness in attacking new problems, and a long, carefully scored free- 
response test may be the best measure. Similarly, a objective recognition 
test sorted students accurately on French pronunciation (Tharp, 1985). But 
in an advanced group it is doubtful that fine discrimination between those 
With authentic and those with false accents could be obtained by anything 
but a performance test. a . 

The most serious charge against recognition tests is that they have often 

een confined to measurement of simple, even trivial knowledge of facts. It 
is possible, as many examining bodies in universities have demonstrated, to 
devise objective questions which call for deep comprehension and subtle 


free responses can measure adequ 


rankings correspond closely; 
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reasoning. Recognition tests are by no means limited to simple mental proc- 
esses. The difficulty is that ingenuity and effort are needed to prepare a 
penetrating objective test, whereas a taxing (if not necessarily valid) essay 
question can be scribbled off in minutes. 


12. A course in psychological testing is intended to prepare students to perform the 
skills listed below. For which would a recognition test be acceptable? 
a. Selecting a test battery for a college counseling bureau. 
b. Administering and scoring the WAIS. 
c. Drawing proper conclusions from a validation study. 
d. Making proper interpretations of technical terms used in test manuals. 
13. What is the relative importance of free recall and recognition of correct re- 
sponses in 
a. learning to interpret children's problem behavior in terms of probable 
causes. 
b. learning to play bridge. 
14. Discuss the following comment by a newspaper columnist: 


"| view with some misgivings the purely utilitarian course in ‘Communications’ 
which has been substituted for the traditional freshman composition course at 
S. - Students’ needs in this course, we are told, are ascertained by the 
administration of ‘batteries of tests,’ | venture to assert that nothing will be 
learned from these tests which a skilled teacher would not find out from a single 
theme and a half-hour interview; and that these would be better for the student 
psychologically, as motivation for the course, than the ‘batteries of tests.’ " 


Taxonomy of Educational Outcomes. 
aptitudes by factor analysis, there h 
on proficiency variables, Tyler’s correl 


In contrast to the vast effort to map out 


ational studies barely illustrate the 
inct types of outcome are there in à 
inference a general ability, or specific 
tween outcomes the same at all ages? 
mes depend on the method of instruc- 


ing the nature of proficiency. 
The structure of proficiencies 

lem than "the structure of ment 

between proficiencies depends 


has been less appealing as a research prob- 
al abilities.” It is obvious that the correlation 
upon what one has studied, Factual knowl- 
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edge of college physics correlates with skill in mathematics, if only because 
most physics students also take mathematics. When he investigates whether 
an “aptitude” such as mechanical comprehension is correlated with mathe- 
chologist often has the illusion that he is dealing 
directly with the natural organization of the mind. That this is an illusion is 
shown on the one hand by the substantial correlation between the TMC and 
the biographical inventory, and on the other hand by theoretical research 
Such as Piaget’s. The aptitudes treated by the factor analyst are no less 
dependent on experience than are the proficiencies. How mathematical un- 
derstanding of science, after training, is related to understanding of the con- 
crete aspects of science is fully as challenging and urgent a problem as any 


question about predictive abilities. 

The only major advance in conceptualizi pi 
from a logical rather than an empirical investigation. À group of specialists 
In educational testing, most of them university examiners, has developed 
a "taxonomy" of educational objectives. This is a grand index of all the 
Variables which instructors and educational testers have suggested meas- 


uring for the purpose of evaluating instruction. The variables are classified 
logically; these groupings provide hypotheses that certain types of behavior 
ii PSychologically similar (for example, that they might be developed by 
Similar teaching methods). : 
As outlined in Table 53, the taxonomy has six major sections: Knowledge 
©mprehension, Application, Analysis, Synthesis, and Evaluation. The 
abilities are listed in an approximate order of complexity; sections are also 
subdivided to separate more and less complex processes. One must compre- 
end somethin g before he can apply it, generally speaking, and he must be 
" le to analyze elements before he can analyze organization. The taxonomy 
Sives a complete definition of each category and illustrates the category 
With severa] educational objectives and several pages of test items. 
tw i taxonomy has considerable value in improving eic ar in be- 
te “en testers and instructors. It offers a standard vocabulary for discussing 
AR Problems and provides a sort of checklist so that evaluators can recog- 
7€ Whether they have listed all the objectives that ought to be measured. 


» i 
qua E iti erformance, ie. to 
Present, the taxonomy is limited to cognitive P diii 


Wledge x : 
» comprehension, and reasoning. — 
- © illustrative test items for measuring higher mental processes are of 
a usual interest. We can select only a few illustrations here, beginning with 
Simple factual item falling in category 1.12 (knowledge of specific facts): 
1 
Ta 


tion ble 58, Figure 67, and the test items in this sectio 
dices’ i j ht 1956 by Longmans, 


matical reasoning, the psy 


i ia as com 
ng proficiency variables has come 


kno 


n are taken with minor modifica- 
Green and Company and repro- 


pm Bloom (1956). Copyri 
by Permission. pS 
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TABLE 53. Synopsis of the Taxonomy of Educational Objectives 


1.00 Knowledge Remembering something previously encountered. 
1.10 Knowledge of specifics. Recall of bits of concrete information. 
1.11 Knowledge of terminology. 

1.12 Knowledge of specific facts. 
1.20 Knowledge of ways and means of dealin 
chronological sequences, 
a field. 
1.21 Knowledge of conventions: accepted usage, 
1.22 Knowledge of trends and sequences. -- 
1.23 Knowledge of classifications and categories, 
1.24 Knowledge of criteria, 
1.25 Knowledge of methodolo. 
1.30 Knowledge of the universals and 
by means of theories. 
1.31 Knowledge of Principles and generalizations. : 
1.32 Knowledge of theories and structures (as a connected body of principles). 


2.00 Comprehension Understanding of material being communicated, without necessarily 
relating it to other material. 


2.10 Translation from one set of symbols to another. 
2.20 Interpretation. Summarization or explanation of a Communication. 
2.30 Extrapolation. Extension of trends beyond the given data. 
3.00 Application The use of abstractions in Particular, concrete situations. 
4.00 Analysis Breaking a communication into its parts so that organization of ideas is clear. 
4.10 Analysis of elements. E.g., recognizing assumptions 
4.20 Analysis of relationships. 
4.30 Analysis of organizational prin 
5.00 Synthesis Putting elements into 
5.10 Production of a unique comm 
5.20 Production of a plan for ope 
5.30 Derivation of a set of abstra; 
6.00 Evaluation Jud 


6.10 Judgments in 


g with specifics. Includes methods of inquiry, 
standards of judgment, patterns of organization within 


correct style, etc. 


gy for investigating particular problems. . 
abstractions in a field. Includes organization of ideas 


ciples. E.g., reco 

a whole. 

unication. 

rations. 

ct relations, 

ging the value of material for a given purpose. 

terms of internal evidence, E.g., logical consistency. 

6.20 sedaments in terms of external evidence. E.g, consistency with facts developed else- 
where. 


= € ————P a 


Number of annual rings at the base of the trunk of 
greater than 


less than 
the same as 


gnizing techniques of propaganda. 


an old tree is 


the number of rings half-way up the trunk 


Knowledge of methodology (1.25) 


: is also at the level of sheer recall, but the 
content is more general. For exa 


mple: 


clues to the past. Some of these fossils are 


imals existing today. How does this affect the investigation 


geological history? (Choose one) 
a. Such fossils make the work much sim 


b. These fossils are rare and therefore 
much, 


pler since they can be easily traced. T 
j 

do not weaken the overall results ve") 

c. These fossils are extremely valuable since observation of their living counter- 
parts yields much information as to climates and physical conditions of the 
geologic past. 

d. The existence of living counterparts of fossils is immaterial since only the 
fossil itself is important. 
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Àn item is not classified entirely by its content, since the processes the stu- 
dent uses will depend on his experience. In a biology course which has been 
studying about fossils, this is likely to be a recall item, but in a general sci- 
ence course which has not touched on fossils, this item requires application 
of scientific method to a new problem. In the taxonomy, the item would 
have to be classified as a measure of application (8.00) for the general sci- 
ence student. 


(A) B (C) 
H H H H "a H H H 
WI l1 | | i 1 | 
rr ry H—C—C—OH ud d di 
H H H " I i H H 
ue ai 
H 
(D) mo 5 
H H H H 
Lidla a 
H—C—C-—C—C— Fe 
| ME 
H H OH 


t+) L. The compound which can neutralize bases and form salts. 


* 2. The hydrocarbon which has the least tendency to “knock” among 
those listed above. : 
‘++ 8. The compound which decolorizes bromine and potassium perman- 
ganate. 
FIG. 67, hemical formulas rather than memory alone. 


Item testing comprehension of organic cl 


Comprehension items go beyond recall and ask the student to restate ma- 


i The item in Figure 67 requires matching organic chemical com- 
ounds with their properties. The compounds are representative of familiar 
pa (e.g, B is an alcohol), but the student is not expected to know the 
e chemical formulas given. This item is classified Ss translation (cate- 
* Y 2.10) since the formula must be recognized as equivalent to the verbal 

nition of an acid, etc. 
he following application item (category 8.00) c 


S 


alls for free response: 


He carefully cleaned a ten-gallon glass 
inches of fine washed sand. He rooted 


tavera] stalks of weak elodea taken from a pool and then filled the aquarium witb 

Water, After waiting a week he stocked the aquarium with ten one-inch gold- 

faith three snails. The aquarium was then left in a esse! of wa en After a 

conditie € water had not become foul and the plants es anima s were in good 
lon aled a glass top on it. 


on. Without moving the aquarium he se á cem — . 
at prediction, if any, can be made concerning the condition of the aquarium 
> , ca 


Joh 
t v Prepared an aquarium as follows: 
ith salt solution and put in a few 


— 


ate Fit 
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7 3 TM be made, 
after a period of severa] months? If you believe a definite prediction can 
make it and then give your reas 


iction for any 
ons. If you are unable to make a P 
icti iv reasons). 
on, indicate why you are unable to make a prediction (give your 
reason, 


ider: nger. The 
The items in categories 4.00 to 6.00 tend to run r H A Phi 
tudent may read an argumentative passage and be asked to te 
stu s 
tion of a certain sentence (4.20, analysis of rel 


c isten 
ationships) or he may lis 
to a recorded musical selection and 


nent 
answer questions on the E: 
izati inci “Synthesis 
f the themes (4.30, analysis of organizational M aei 3 ra decimi 
o rily requires free response; €g., one item asks the studen "pem n 
T ; i i ca > 
“chemical process to satisfy given specifications. As a final ex amp Bi 
part of an “evaluation” item (6.00). The student is to — ami 3 
Some surprising statements about language by an Otto pcm apa 
tell whether each of the following facts would lead him to trust J 
Statement, or to distrust it, or would have no significance: 
a. Mr. Jespersen was Professor of English 


at Copenhagen University. Mr. Jes 
b. The statement in question was taken from the very first article that Mr. 
persen published. 


€. Mr. Jespersen’s boo 


con- 
ks are frequently referred to in other works that you 
sult. 


The taxonomy is ysis of constructs which ae 
describes the w. are organized. Many decades : gu 
however, the co ogy taught psychologists to T ae 
Picious of pure ; izations: Though the faculty psycho “3 
vers of the mind such as memo: 
no way to measure these pem i 

eneral adaptive ability. The d 
Bories of the taxonomy refer idden mental powers, but to observa 
ems. These abilities are impen d 
whether the categories describe "d 
€n different names correlate high s 
f tests within the same category cor 


an impressive anal 


© grouping is artificial, ed 
cent study of the Organization of intellectual skills appear 
prior to taxonomy. Furst (1950) administered 
subject fields to two Stoups of students at the st, 
and again late in the twelfth grade. Within each 
factual knowledge, judgment of relations, a 
Broup in a private expe al school w 
integration of courses and developme 
other group, from publie high schools, 


the content areas being sharply separat 


27 tests covering severa 
art of the eleventh pane 
subject there were uie 
pplication of principles, ^ a 
as taught by a method gu 
nt of higher BOGHA DUM, EE 
was taught by more formal me 


ine 
ed from each other. Furst loti 
al Processes had different correla 


riment 
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at the start and end of the experimental period, whether the training pro- 
Bram affected the pattern of correlations, and how highly tests measuring 
Similar intellectual processes were correlated. The total study involved 1600 
Correlations, and only a superficial résumé of results can be given here. 
Despite the differences in the two educational programs, the correlational 
pus for the two groups were nearly alike. The most important finding 
dee tests dealing with the same subject area had higher intercorrelations 
"h tests dealing with the same mental process. In Table 54, based on 


TABLE 54. Correlation Among Proficiency Tests Categorized 
by Subject Matter and by Mental Processes 


Average Average 
Correlation Correlation 
Within Group with Tests Not 
of Tests in Group 
Subject-matter 
roupings: 
English 5 .48 492 
Humanities .28 23 
Social studies AS 35 
Physical sciences AS 31 
Mathematics 55 34 
All categories 44 31 
Mental-process 
groupings: 
Critical thinking 38 32 
Recall of information 25 ET 
Reading 252 39 
Language expression 44 .29 
Application of principles E" = 
Inter, i ata E F 
pretation of d 5 ES 


All categories 
Source: Furst, 1950. 


mate school, the average correlation among a group of tests is compared 
value © average correlation of those tests with all other tests. The former 
Shen must be higher before one can argue confidently that tests within the 
is "fena some common ability. Subject-mattor groupings clearly meet 
mon gir nate mathematics tests or science tests do Ne more in com- 
Subject . tests of the same subject than they do with tests outside the 
Scientif his is to be expected. Knowledge of scientific facts, application of 
the uL. pones, and interpretation of scientific data are developed in 
Well on "il - and the same pupils who do well in ie class tend to do 
is Essentially E 1B Hun AGRIS The evitdienge an KU qase unquam 
tests ot appleafan fn yi n gba = bag ps pier aS 
each other th DE er A p pepe kamiss, Fu Sra 1 
an they do with tests of other processes. Likewise, Fur und 
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little evidence for a general ability to think critically or to interpret an 
This study leaves considerable need for further information. The = zi 
tion between tests in the same field is not high; evidently the various 7 
within the science field, for example, do measure somewhat distinct a - 
ties. But we have little knowledge as to why some students develop E 
scientific ability more than another, and little understanding as to how 


1 ^ interpretation 
terpretation of data in science differs psychologically from interpreta 
of social data. 


x E ; ienc 
15. a. List several outcomes which might be considered in evaluating proficiency 


in algebra. 

- Classify these outcomes according to the taxonomy. . — 

- What empirical questions might be asked about the relation between zm 
several proficiencies? What value might this information have in desig 

subsequent tests, in altering instruction, and in guiding students? audi 
16. In question 15, substitute clinical psychology for algebra, and answer the s 

subquestions. 


Would a logical taxonomy of psychomotor abilities have led to the same re 
sults as factor analysis? loyer 
18. The following quotations from want ads specify proficiencies that an emp n 
might want to test by interview, written test, or other methods. Locate the p 
ficiencies in the taxonomy as well as you can. "EM 
a. "Wanted: Young man for advertising agency with 'a flair for writing. iih 
b. “Wanted: Senior marketing research analyst, thoroughly familiar W 
customer testing procedures.” fall in 
19. What might a person studying psychological testing learn that would fa 


1, 
each of the following categories of the taxonomy: 1.11, 1.21, 1.23, 1.24, 1.3 
2.30, 4.10, 5.20, 6.20? 


17. 


PUBLISHED TESTS OF EDUCATIONAL OUTCOMES 


y il 
Among the myriad tests which have been published for measuring puP! 


accomplishment, some are concerned with single subjects like history 9 
Science. Batteries of achievement tests have subtests measuring sever 
important areas of school attainment. The subtests are standardized 10 
gether, so that one c. ative standing in one subjec 
with his standing in areas most commonly measured 1? 
elementary-school include reading, spelling, language une 
i - On the whole, because of the way a A 
s have been more widely used at the peur 
School and college leve] than comprehensive batteries, but tests of p 
educationa] development are now widely used for selection of student 
and for guidance, 


t- 
1 " nplà 
mine how much science a student e P 

í Bock i; m 
ws, his general scientific competence is O 
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interest than his mastery of a particular subject. A few recent test batteries 
for high school and college attempt to measure general educational develop- 
ment without regard to narrow subject-matter divisions. One important use 
Of such tests is to evaluate a man's readiness for college (Dressel and 
Schmid, 1951 ). GED tests were first designed to assist men returning from 
military service to reénter the educational system at the appropriate level, 
regardless of the amount of formal credit they had received. Lindquist, a 
designer of the original tests, indicates their philosophy (1944, p. 366): 


The real ends of instruction are the lasting concepts, attitudes, skills, 
abilities, and habits of thought, and the improved judgment or sense 
of values acquired; the detailed materials of instruction—the specific 
factual content—are to a large extent only a means toward these ends. 
Since the detailed materials out of which a self-educated serviceman 
might have developed his . . - thinking might differ considerably . . . 
from those used in formal classroom instruction, we felt that . . . we 
must try to measure as directly as possible the ultimate outcomes of a 
general education, and to minimize as much as possible the formal 
be used to attain them in classroom 


pedagogical procedure that may 
Instruction. 


GED batteries measure mathematical ability and English expression by 
Father conventional items, but in science, social studies, and literature, in- 
ira of testing what scientific facts or works of literature student is familiar 
With, the battery uses "tests of interpretation." He is asked to read a passage 
resembling those in college science texts, and then is tested for comprehen- 
sion. Similarly, he is required to interpret social science materials and pas- 
Lom from literature. The test draws on knowledge but requires few specific 
‘acts, It should be noted that these tests are measures of general education, 
Le, of Proficiencies that may apply to a wide range of future experiences. 
ae for a specific course (e.g college zoology ) depends both on gen- 

5 Intellectual development and also on specific attainments from prereq- 


ia . " : 
je - courses, The latter are measured by proficiency tests in particular sub- 
Cts, 


F " , ; 
9r teaching purposes, measures of overall proficiency are not sufficient. 


ind teacher needs to know specific strengths and weaknesses of each pupil, 
ation. Diagnostic tests focus on the 


diagnostic tests provide this inform 
han the product. Diagnostic 


eme by which the student responds, rather t à; 
en ures in reading will be described below. Since they stress analysis of 
ogee ee errors rather than comparison between students, diagnostic 

ee are rarely standardized. i " 
ectin y proficiency tests have measured knowledge an routine skills, neg- 
8 higher intellectual processes. Recent tests have paid more attention 


Oc : : 
Omplex intellectual skills such as interpretation of experiments. Many 
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of these are "transfer" tests requiring application of skills and ideas to situa- 
tions not studied. Tests of ability to apply principles ask the pupil to solve 
unfamiliar problems using the principles he has learned. If a pupil ean : 
a problem he has not studied and defend his solution with a sound pe ^ 
principle, it is certain that he understands the principle. The TMC is in ef- 


Fuss : M is the 
fect a test of application of principles of mechanics. Another example is th 
"aquarium" item above. 


i f 

20. If admission to college depends in part on a test which stresses knowledge o 

historical facts, what instruction given high-school seniors would improve t is 
chances of passing? What instruction would help them most if admission 


pretation of social studies materials? PTS 
21. Discuss the argument: "The GED tests of interpretation are measures of inte 


z iect 
gence and reading ability rather than of educational development in subjec 
fields.” 


Important Educational Achievement Batteries 


2 it 
The following list o£ educational tests is by no means exhaustive, but i 
covers many prominent types includin 
most likely to encounter, Most of the 


ment Tests; Ernest W. Tiegs and Willis W. Clark: 


abulary, reading compre 
oning, arithmetic fundamentals, English naan 
abilities .79_.95 (Grade 6). Large agen” 
Tes are Significant, The proposed € 
Broups of items is not a dependable basis *" 
t - Norms are derived from a representative eo 
tional sample also used for norming the CTMM, thereby permitting : ° 
curate comparison of achievement for each pupil with that of pupils having 
see below), s 
Social and Related Sciences; Georgia Sachs Adam 
“st Bureau, 1946, 1953, Grades 4-8, 9-19. A test n 
} yielding six scores with reliability 85 .95 (6th grade). Four 
i sover history, geography, and other social studies; two sections cover 
Science content. Ite ary level test knowledge of the inp 
the typical program in general pum 
level deals specifically with Amen 


" b y es" 
> and with a mixture of factual and reasoning d" 


ari 
9 Essential High Sch 
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Durost; World Book, 1951. Grades 10-13. A three-and-one-half-hour battery 
measuring knowledge in mathematics, science, social studies, and English. 
Each section covers specific course content rather than general comprehen- 
Sion. For example, the mathematics test includes algebraic factoring, recog- 
nizing graphs of conics, and recalling theorems about perpendicular chords. 
Other problems cover everyday arithmetic reasoning, use of tables, etc. The 
Science section surveys factual and vocabulary knowledge and also meas- 
ures ability to reason from principles to conclusions. The single score for 
each area is less analytic than the finer subdivision given in ITED or STEP. 

* Evaluation and Adjustment Series; Walter N. Durost (ed.); World 
Book, 1950, A series of tests for high-school use, each test with a different 
author and for a different subject. (Examples: Anderson Chemistry Test, 
Davis Test of Functional Competence in Mathematics, Engle Psychology 
Test.) The several tests vary in quality, but each of the better tests repre- 
Sents a comprehensive survey of outcomes regarded as important by spe- 
cialists in the field. The chemistry test covers principles such as valence 
and Photosynthesis, practical applications, interpretation of experiments, 
chemical formulas, and quantitative problems. Norms are for students who 
ave had one year of chemistry. In general, tests in this series are well de- 
Signed for end-of-year evaluation of attainment in basic courses. 

* Iowa Tests of Basic Skills; E. F. Lindquist and A. N. Hieronymus; 
Houghton Mifflin, 1940, 1956. Grades 3-9. A battery requiring about five 
e yielding scores on vocabulary, reading, arithmetic, language, and 

feat skills, each having a reliability .90 or over (Grade 6). Nobis are 
"p 2 carefully selected national samples for each grade eariy in € 
culty, s and at end of year. Each test contains sections of rape i - 
or i Pupils in any grade take only those sections appropriate in dif culty 
of ame but questions for adjacent grades overlap. All sections require use 
abili ills in meaningful contexts. The section on work-smdy skills measures 
rial Fi to read maps, graphs, and charts, and ability to use reference mate- 

> ndices, etc. 

Sod, Tests of Educational Development (ITED); E. F. jede ea 
atter, oa Research Associates, 1942, 1952. Grades A a = iis 
i "d a nine tests designed to measure general enn es ne 
Studied in hiding abilities, regardless of P eriam ts, interpreta- 
tion OF Cores include understanding of basic T DE T drin A 
antitana nd a in social studies, 2 pn of expression, et . 

Beo thinking, correctness and GLA co "d The “a> £s 
Predicts are carefully normed; reliabilities range trom - - e S 

college grades with validity near 60, this high validity being attrib- 


Utable + 2 : 
as in part to the length of the battery. "Secure" versions of ITED, of 


us lengths, are used in scholarship competitions, and in the American 
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tests such a difference in score represents very little difference in ir uai 
cause the average rises only slightly from Grade 6 to Grade 9. In E 
we see curves for two tests of the Stanford Achievement series in = 
grade norms are compared to statistically derived “K scores.” We canno d 
consider the assumptions made in deriving the K scores, but the p 2 
implication of Figure 68 would hold for almost any scoring "ia sete 
Language, grade increments above Grade 7 imply only very small img 


140 
130 
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110 


100 


co b: 
3. 8 S 


K Scores Representing Equal Units of Ability 


a 
o 


50 


12 
1 2 3 4 5 6 7 8 9 io m" 


Grade Equivalent Scores 
FIG. 68. "True" increases in abili 


ty corre. 
subtests of the Stanfor 


i g we 
sponding to equal changes in grade scores fo 
d Achievement Test, 


e 
ments in performance. In Social Studies, on the other hand, a three-gt ad 
gain represents a large increase in knowledge. sixth- 
“Ninth-grade levels” in different subjects are not equally hard for bi eit 
graders to reach. The pupil who is “two years beyond his grade” in a subj 


i ing is equale 
may sometimes be markedly superior; at other times this standing i 
by a large proportion of his class, 


n ns 
ar " versio 
One further serious limitation of grade equivalents arises when con 
for very high and low scores 


rect 
are derived statistically rather than by jae 
observation, Ina test battery intended for Grades 4 to 6 the author eque e 
Pupils in those grades and then may determine by extrapolation wha times 
“ought to correspond” to the Grade 2 or Grade 8 average. This is some 
done just to save effort in st 


se ig iM 
; tis! 
andardization, and sometimes because 1 
possible for Second-graders to take a sixth-grade test. 
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Grade norms imply that two pupils with a grade equivalent of 7.0 are sim- 
ilar, even if one is in Grade 4 and one in Grade 9. This is just as unsound as 
assuming that an MA of 12 means the same thing for a 9-year-old and for a 
14-year-old. The advanced pupil and the retarded pupil with the same score 
make different errors, and they are by no means ready for the same type of 


instruction, Grade norms can lead teachers and parents only to unsound con- 


clusions and should be replaced by percentile scores based on single-grade 
groups in a defined type of school, or by some similar system. We may ex- 
Pect the grade norm to remain in use long after all test specialists agree on 
its inappropriateness, just as it took a long time to displace the ratio IQ. 
Teachers and school administrators are used to it and look for it; the pub- 


160 


CTMM Mental Age (months) 


20 30 

Anticipated Reading Compre 
chievement Test in Grade 7. (Re- 
T manual, 1957.) 


hension Score 


FIG. 69. Expectancy chart for Colifornia A 
designed from the chart presented in the CA 


lisher of a new test feels that he must satisfy this demand; and the vicious 

Circle rolls oH. 
p Expectancy Norms. The employer who uses a 
Don Performers is concerned only with raw scores. The educator, however, 
‘nts to know if the student is making as much progress as he should. He 
*refore wants to evaluate performance relative to ability, and gain over 
Performance at the beginning of training. The proper procedure compares 
à Pupil to the td expectancy for his ability. The technique is illus- 
Ta ed by the expectancy charts developed for the California Achievement 
ES The expectancy chart shows what proficiency score can be expected 
os Pupils with each score on a mental test. Figure 69 shows a simplified ver- 
at be the chart for reading comprehension in Grade T. The tester enters 
2e left with the pupil's MA from CTMM. The line indicates the normal 


“Aievement for pupils with his general ability. For example, a pupil with 


proficiency test to screen out 
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MA 180 (15 years) is expected to earn a reading score of about 54. Any lower 
Score indicates that he is performing below his ability. 

One great advantage of the expectancy chart is that it enables the aid 
to evaluate the attainment of his group even if it is not typical in mental abil- 
ity. A teacher who finds that end-of-year performance is below average 
usually dismisses the finding if he knows that the group was weak to 
start with. The chart can show whether this class is performing as well as 
did comparable weak pupils in the norm group. 


26. What reading score is expected, according to Figure 69, for a seventh-grader 
whose CTMM MA is 14 years? 


27. Might there be an advantage in 
pupils from lower-class homes? 
28. Tests of the Evaluation and Ad 
charts in which expectancies ar 
satisfactory than the use of me 


preparing separate expectancy charts for 


justment Series are provided with expectancy 
e shown as a function of IQ. Is this more or less 
ntal age as in Figure 69? 

Conversion Scales for Recording Progress, 
the school program in using the s 
necessary to use more difficult test 


; or 

There are obvious advantages f . 
it is 

ame tests from year to year, though it i 


t 
s as the pupil advances through school. I 
is also advantageous to standardize tests for the several subject areas on the 


same or comparable groups. Publishers of educational tests have invented 
single conversion scales which can be used for all the tests they publish. 

The usual method for articulating consecutive tests at different levels is 
the equipercentile technique (see p. 98). Suppose a reading test has tw? 
levels, one for Grades 7 to 9 and one for Grades 9 to 12. Both tests may be 
given to a large sample of ninth-graders, and percentile scores may be der 


termined. The results will look like this: 


Lower-Level Upper-Level 


Raw Score Raw Score 
80th percentile 62 49 
Oth percentile 48 33 
Oth percentile 32 18 


a 
Then a raw score of 49 on the upper-level test is regarded as equivalent to 
Taw score of 62 at the lower Je 


a 
; vel. This permits measurement of growth dn 
Pupil who is Biven the lower form in Grades 8 and 9 and the upper form 
Grade 10, 


" " se 
29. On the test discussed above, interpret the growth shown by a pupil ien 


. i li 
eighth-grade score (lower level) is 32, ninth-grade (lower level) is 48, ? 
tenth-grade (upper level) is 35. 


Reading Tests 


r irst, 
i deserves special attention in this book for two reasons. pe 
Teacing tests have been developed in greater number and variety than 2 
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other type of achievement test and demonstrate numerous problems in test 
Construction. Second, they are used more widely in guidance and clinical 
examinations than other achievement tests. 

Definition of Abilities. At a glance, reading seems to be a clearly defined 
skill which could readily be measured, but tests having the same name 


measure quite different behaviors. Authors disagree on what reading tests 


should include and on the most useful definition of rate, comprehension, 
Word knowledge, etc., for testing purposes. One author examined 24 reading 
tests and found that between them they measured 48 differently labeled 
skills (Traxler, 1941). This does not mean that reading involves 48 specific 
abilities, however. One test claimed to measure several “entirely different 
Teading skills, but correlations showed that these scores actually measured 


the same function over and over under different names. In factor analysis of 


25 tests of reading and study skills, the following common factors were 
found: tendency to read carefully (an attitude or habit), inductive reason- 
Ing, rate of reading, verbal ability, vocabulary, rate for disconnected facts, 
and chart reading (W. E. Hall and F. P. Robinson, 1945). In view of such 
Variety of test conten the person who needs a reading test must be careful 


t à — 
© define what reading ability he wishes to measure. 


inarily i «d to assess general level of 
Survey Tests, Survey tests are ordinarily intende gen E 
ý n pupils for remedial teaching, 


x ung development. They are used to scree iic qoi ri 
m ik success in courses, and to check whether poor reading explains a 
°F Score on a mental test. . 
Reading rei includes both speed of reading and comprehension, 
and a useful test must consider both these elements. Most testers have tried 
9 measure the two aspects of performance independently, but they have 
SH largely unsuccessful. This problem occurs in most testing, but rarely is 
It so Obvious as in reading: when an act has several integral aspects, one 
Cannot divide the act iie: fragments for testing purposes. - 
a theory, the way to separate speed and comprehension 1$ AOOO. ONS 
ronstant while the other is measured. Speed can be minimized by giving a 
test Without a time limit; the subjects understanding of what he reads then 
Should be a measure of comprehension alone. Rate is much harder to isolate 
cause every person has many reading rates, which he changes with his 
Purpose and with the material read (Blommers and Lindquist, 1944). The 


device for controlling comprehension is to require the subject to an- 
Swer q o cross out absurdities as he 


eads, uestions about what he has read, or t 
for the lower grades are those con- 
ble of being administered separately ). 
dial needs, there are sev- 
Section of the Diag- 


: Most prominent reading tests 
Or or standard batteries (also capa quem 
era] ege guidance and screening to ete ae 

“arefully developed tests including the Survey 


taing 
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nostic Reading Tests (see below), the Davis Reading Test (Psychological 
Corporation), the Coéperative Reading Test (Educational Testing Service), 
and the Kelley-Greene Reading Comprehension Test ( World Book). 

The following limitations should be borne in mind in evaluating reading 
survey tests: 

9 The scores on rate and comprehension are often interdependent, so that 
the subject can raise one at the expense of the other. When only a single 
"rate of comprehension" score is obtained, thoroughness may lower the sub- 
jects rate score. 

9 Time-limit tests supposed to measure comprehension often are strongly 
influenced by rate of reading. Such tests have little diagnostic value, al- 
though they may be good predictors of school success. 

€ The reading test covers only a selected range of content, yet reading 
ability varies somewhat with different materials. Some people can read his- 
tory well but not science; some do well on stories, poorly on textbooks. Dif- 
ferent content is appropriate for different testing purposes. 

€ Many tests measure only a limited type of comprehension. The skilled 
reader must be able not only to follow sentences but also to take the main 
idea from a long passage, put together ideas from separate sentences, follow 
a logical argument, and so on. Some reading tests measure only the simplest 
comprehension, whereas others demand deep and thorough interpretation. 

Diagnostic Methods. A diagnostic proficiency test at its best is an impressive 
tool. With or without such tests all teachers and school psychologists must at 
times determine why students are having difficulty. An ideal diagnostic 
reading test calls attention to every aspect of the reading process wherein 
the pupil might have stumbled, Checking off one at a time the many sorts 
of possible error, the tester is left with a picture of the specific weaknesses 
that must be remedied before the pupil can make normal progress. 

Such a diagnostic procedure must be based on extensive research to de- 
termine the common types of errors. Once the errors are listed, it is nece" 
sary to devise test procedures to reveal which errors the pupil makes. Sys" 
C ia diagnostic methods have been worked out for arithmetic and a a 
ig ending nc ee iis dot e 
niques, and a ate ha : jane Coon cage eon ds spe eer 
simple for nonspecialists y age organized into Gantt n the 
Durrell Analysis of Re E ene Among the widely known methods is 

ading Difficult (Durrell, 1940). 

Durrell based his test: : a school 
children. The tests pr vin study of the reading errors made by 4000 "^ in 
oral and silent o mds opportunity to observe the child at wor ae 
reading. The tester record, ho pee d tests. The first tests deal with 07 

cords the time required to read the standardized pa"? 
graphs, and notes errors as the ^ ing i d on ? 
y occur. Silent reading is then checke 
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set of paragraphs of difficulty equal to the oral series. Questions are used to 
check recall, and the teacher observes such reading habits as lip movement. 
A flash-exposure device is used to show words briefly; this detects percep- 
tual habits and errors. Finally, there is a phonetic inventory for children who 
have difficulty in word perception. The analysis is not a mechanical device 
—it calls for keen observation by the tester. In the oral tests, the tester must 
record phrase reading, hesitation on words, mispronunciation, omission of 
Words or syllables, neglect of punctuation, and enunciation. The virtues of 
the test are that it presents materials of standardized difficulty and that the 
checklist of errors calls the tester's attention to all the significant facts. 
The type of information that comes from a careful diagnosis is illustrated 
Y Durrell's report on Anthony, age 9-8, in the fourth grade. His Binet MA 


pu 944, but his general reading achievement was at the low second-grade 
evel, 


On the Durrell Analysis of Reading Difficulty, Anthony made a low second- 
Stade score on oral-reading tests, but seemed quite unable to keep his attention on 
Silent reading. He did poorly on quick perception of words, and had no method of 
Word analysis, He read a word at a time in a strained voice and a monotone. He 
Was markedly insecure in his reading and repeated words continually. He was un- 
aware of the errors in his reading, indicating à lack of concern about meaning. 


Pn his errors were corrected in his oral reading, his comprehension was excel- 


The silent reading was marked by a high rate at the expense of mastery. He 


theped all the hard words; as a result his recall was scanty and inaccurate, al- 
5 Pugh he did the best he could with it. Strictly speaking, he did not read silently 
At all, since his reading was accompanied by constant whispering of the words, 
in © sounds being given for the difficult words. His eye poesie n in silent nead 
B were irregular and unrhythmic, with seven to ten per line and many regressive 
Movements, 


Much simpler diagnostic tests are designed for group pui cades 
ese contain subtests presumed to measure various types of reading ability. 
®rformance is represented as a profile showing the relative strengths and 
veaknesses of the pupil. The Diagnostic Reading Tests contain a survey sec- 


9n and a di z lied to pupils who do poorly on the 
"hie needs di d for Grades 7 and above offers the 


la The diagnostic battery designe 


Vo 
Cab; ; 
lary in Special areas 


T 
Mat sh grammar and literature 
thematics 


cience 
m 
Go cial studies 
prehension 
e à * 
nt reading of textbook material 


o; 
"prehension of similar material read to the student 
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Rate of reading 
General 
Social studies 
Science 

Word attack 


x " ial 
Oral. An individual test for observing speed and errors in attacking new materia 
Silent. A group test of skills such as syllabication 


TESTS OF SKILLS IN PERFORMANCE 


Development of skilled performance in a repetitive task is the goal of od 
struction in typewriting, comptometer operation, shopwork, blueprint yoat 

ing, dressmaking, and industrial training. Measurement of maximum a 
to perform is based on the principle of the worksample. One rates a sample 
of the work produced, or observes and judges the performance itself. Many 
of the methods described can also be applied to the study of typical behavior 


on the job. Methods discussed in Chapters 17 and 18 are especially designed 
to assess typical performance. 


Product Rating 


a ach 
For product rating, we must compare specimens of the best work of ea 
person. To compare people, it is desirable to h 
terial. One stand 


dictation which the subject must take d 


y of material but also the speed of "uad 
945) standardized a test in n 
Woodworking by requiring each boy to construct a wood block like a mo 5 
The block was designed to demand use of saw, drill, and chisel. Scoring ie E 
done objectively by imposing a plastic pattern on the block to check dim 
sions, 


a ERE " ces 
Objectivity in Scoring is aided by a checklist or rating scale. This for 


Scor? 
1 2 


3 
ls 
Appearance 1. Shriveled 


Plump and slightly moist 
Color 


2. Pale or burned Well browned 
3. Dry Juicy 


2. 

3. 

Tenderness 4. Tough 3 - ii fork A. 
g Easily cut or pierced wit! " 


Moisture Content 


PROFICIENCY TESTS 393 


Judges to notice the same features of each sample and to use a comparable 
numerical scale. Two product rating forms are illustrated in Figures 5 and 70. 


Observations 


Observations or measures of active performance are needed when the 
Product is not an adequate index of a skill. The civil service typing test is 
Such a measure, indicating both speed and quality of performance. Some 
tests use regular factory or shop equipment, while others use special appara- 
tus. Use of regular equipment is illustrated by a test of ability of packers in 
à cannery. Production on the job could not be used as a test, because of fac- 
tors varying from day to day and because the job is normally affected by 
teamwork, For test purposes, one conveyor belt was set aside and one 
Worker at a time assigned to it. A count was made of the number of cans he 
Packed per hour (Stead et al., 1940, p. 86). . 

Special equipment is used to obtain a worksample where regular equip- 
ment cannot be used, because of either cost or danger. It is essential for 
SVery submarine crewman to learn to use the escape hatch of his ship in case 
ìt should sink in shallow water. The only sure test of ability is to have him 
SY to use it, but it is obviously impossible to make the test at sea. To test 

and to train) crewmen on shore, a replica of the escape hatch was built in 


ane tank. Since this test reproduces all essential features of sea condi- 
lons, 


Hity Of aerial navigators was tested by showing them a motion picture, 
i from a plane, giving a view of the ground and of the essential instru- 
Ments, Aided by a map, students were to make a plot just as they would in 
; oved to have little reliability ( Carter, 


h an Sharpening a drill point, as shown in a film, on two occasions one month 

bi - (Siegel, 1954). Even though the questions dealt with readily observa- 

ne (e.g., Did the man wear goggles while grinding? ), the raters’ an- 

of ^» t the second occasion agreed with their first answers only 82 percent 
* time (50 percent being chance expectancy). . 

the Valuation of performance is improved by recording systematically what 
Subject does. Mechanical recording devices are especially valuable 
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E ; s n ex- 
where a performance is rapid or where subtle details are m panto: 
ample from industrial training is provided in Lindahl’s study of disk-c 


Writing Arm 


FIG. 71. Schematic diagram of Lindahl's recorder (Lindahl, 1945). 


operation (1945). The difference be 
found to lie in the Speed with which 
cycle of operation, The Operation c 


ys was 
tween good and poor nu the 
they went through each phase © 


iye the 

alled for pressing a pedal to driv ‘ane 

cutting wheel, and releasing it for a new cut. Lindahl devised the recor” * 
device shown in Figure 71, which 


n Figure 74 
yielded records such as that in Figu” 
These objectiy 


Y > i oe 
€ records showed which workers were the best peace led 4 
more important, what errors each w - 


as making. The records also pro g him 
means of teaching the worker what errors he was making and dee / 
recognize the *fee]" of the pedal when he was doing the act correct) 

i tion 
30. Outline a plan for obtaining Product ratings and performance observa’ -f 
Owing situations, In each ca 


Testing a boy's knowled e of h " " " 
i S. 
b. Testing the i g Ow to wire batteries in serie 


) oa 
orners could outread sixth-graders down the * 
as never brought to light. Parents and pes n 
nce of school graduates and, according to the p 
- and, sou 
positions, were pleased or displeased with the results, There was nO ° 


9 Hours 


an QUEUE DN 
45 Hours 


RENNES e 


eo ee ru ie 
141 Hours 


ee 


239 Hours 


ot-action pattern of a 
s long pauses between 
(downstroke), and 


Ha 72. Improvement in the fo! 
‘ainee. The record at the top show 
strokes, uneven speed during the cutting 
jerky foot action at the end of the stroke. All these faults 
were eliminated in the final record (Lindahl, 1945). 
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basis for judging whether the school had taught as well as might reasonably 
pected. i 
e first systematic comparison of school attainment was made by an ie 
cational crusader, J. M. Rice (1897). He was convinced that the sarad ee 
perfection in certain accomplishments was leading to faulty empha oA 
education, and he prepared a spelling test to collect evidence for an a n 
on the subject. His test, given in 21 scattered cities, showed that ie om 
scores of eighth-graders were about the same in all cities regardless 0 ke 
time devoted to spelling. Although children in some cities were s 
spellers during early grades, presumably because of stress on that su E 
such differences vanished by the end of schooling. Rice hoped to m 
vince teachers that they could reduce the time spent on formal skills, ien. 
more time for an enriched curriculum. Ironically, the testing mes 
which he fathered tended instead to chain the schools to limited curric" 
and to increase the emphasis on a few skills. ining 
Educators were quickly impressed with the advantages of determ pes 
whether schools were "up to standard," and tests of reading and espe 
were prepared and widely used. Tests in other subjects followed. e ds 
cities, at the height of the enthusiasm for standard tests, every pup’ sie 
given a nationally distributed examination each June in nearly every en 


] ing craze 
he studied. Despite the marked benefits conferred by tests, the testing 
eventually produced serious disloca 


The Navy program described at th 
tests are a powerful instrument for 
The tests show which teach 


that administrators can tak 
be used in this manner c 


tions in the school program. 

e start of the chapter demonstrate “oom 
administrative control of the ay dn 
ers are bringing their groups “up to eque 
€ prompt remedial action. The fact that te 


her 
: if a teach” 
onstitutes an obvious threat. Even if a 


E 
knows that no one in his school system has ever been discharged 07 gim 
manded after his class made à poor showing, the desire to make a 2 The 
pression on his Superiors will cause him to take the tests serious vel 
teacher relieves his anxi E 


g that 


in 
€ on his pupils to work harder. This increase ^ — gy 
on both sides might increase the amount pupils learn, but it has freq or ? 
raised tension in the classroom to an unhealthy level. It is one thine uit? 
ene Hi demand thorough Preparation and work of good quality; Í as pest 
another for the teacher to whe Ys 1] “make 
whip his cha they will “m 
record in the school.” d iceland das* 
Administrative], ; in the © uer 
Y imposed test i ify the effort in at 
room; they channel ver Asse deus M 


M a n 
that effort until i ‘ome entirely * < pe? 
of preparing for the il teaching can beco 
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The New York State Education Department, better known as the 
Regents, administers uniform examinations, also better known as the 
Regents, semiannually to all high-school students pursuing key subjects. 
To prepare their students for this ordeal, many teachers abandon the 
regular textbook in favor of a special booklet containing a review of the 
Subject and a reprint of recent Regents' examinations. The general prac- 
tice is to begin the review about four to six weeks in advance of the big 
test, although some teachers start Regents preparation as early as the 
first day of the term. 


Such concentration may be reasonable if the test measures what the pu- 
pils ought to be learning, but it severely restricts education when the test 
covers the wrong outcomes or covers only a few of the desired outcomes. 
The writer recalls visiting a rural school which was alarmed because all the 


boys, upon finishing the compulsory eighth grade, left school to work on the 


farm, 'The principal believed they should stay in high school, but the boys 
Considered school a waste of time. A look at the “literature book" for 
Grade 8 supplied one clue to the difficulty. Pupils were being held to selec- 
tions about a Hindu boy and his village, mountain climbing in Tibet, and 
other topics of remote interest. When the teacher was asked why she did 
not encourage the boys to develop their language skills on bulletins from the 
agriculture extension service that the boys would consider valuable, her 
answer was: “I know this book isn't good, and the boys don't like it, but I 


‘ve to teach it because it prepares pupils on topics covered in the standard 


test given at the end of the year by the County Office." P 

test (or set of tests) is said to have “curricular validity” if it represents 
the objectives of the curriculum the pupils have studied. Instruction should 
not be identical in all classrooms of a given grade. Even within the same 
class, it may be proper for different pupils to work on different skills at a 
Siven time. A standardized test necessarily fits one particular set of objec- 
Wes and one particular body of content. Uniform instructional aims may be 
assumed in Navy training; every torpedoman must learn the same things no 
Matter what school trains him. In public schools there is much less justifica- 


tion for uniform content, Everyone would agree that elementary-school pu- 


Pils should learn certain basic concepts about society and the community 


or example, interdependence of communities and nations). One school 
might Approach this by a survey of local industries. Another might de- 
velop the same conce 4 with a unit on Great Britain. Perhaps a school in 
9Xas w P ponsive to a unit on South America. All 


of ould find pupils more res . 
€se programs would aim toward the same goal, and yet their content is 


So 
different that no one test fits all three approaches. 
© same problem arises even in fundamental skills such as spelling and 
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arithmetic, where objectives are more definite. Some teachers develop spell- 
ing incidentally to instruction in other subjects. Drill on long lists of words, 
as Rice first pointed out, produces very small permanent gains. It is reason- 
able to suppose that pupils will learn spelling just as well if they master 
words they have occasion to use. But when the teacher knows that the test 
will be a random sample of words from a "standard word list," he cannot 
hope to make a good showing by concentrating on words that pupils misspell 
in writing about South America. One published spelling test, for example, 
uses words such as anxious, foreign, vitamins, biscuit, admission, ete. The 
only way to insure a high score on a test like this is to have a daily spelling 
drill with a miscellaneous list of words. In arithmetic, all teachers cover the 
same content, but there are wide differences in opinion regarding the ap- 
propriate timing of a particular topic. Should fractions be introduced in the 
third grade? If the test will include such items, the teacher is likely to "Y 
to squeeze it in even though it would be wiser to put extra time on Re 
division. Conversely, if the test omits fractions, a teacher hesitates to spe? 
time on them even when a class is interested in fractions and ready for SUC 
work, 
Tests have effects on the pupils also. The pupil learns that “what really 
matters" in any course is what shows up on the tests. Anything the teacher 
introduces which will not be tested is likely to be regarded as a side show: ' 
mathematics teacher may try to show the similarity between geomet 
postulates and the premises hidden in advertising appeals or political ve 
me E know that their tests will cover mathematics rey" p 
y the digression, but they will not study the material. By 


mee the student is keenly alert to the fact that tests cover 9? 
e course, and is sure t i inks he Wi 
onus on. o focus his study on what he thinks 


Because i ing 
Fei of increased Tecognition of these problems, there was 2 sw 


ay from wholesal Oe : p follow” 
ing 1980, m ‘ Pos administratively imposed testing in the years 5 


s t 
both the a ol testing programs became inadequate with respe? 
tional con p of information collected and the use made of it. The sfu 
launching of i Xon edveational quality brought to a peak by the —" 

9t a Russia: ite i 3 : iona 
cern with tests as a n satellite in 1957, revived public and profession? — |. 


ing shortly after th means of quality control, President Eisenhower» P jon 
might be the a = Russian success, suggested that a national cn 

est wa $i ca 
legislation adopted Y of raising educational standards. The edu stat? 


by Congress in 1958 made special provision or nifi 


t tes ^» 


i s, " yan 
E ingle testing Investment for the statewide prog that 
sing standard tests as a means of quality contro 
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they will discourage teachers from introducing untraditional material or 
trying new methods. The test is likely to focus attention on those outcomes 
easiest to test, to the neglect of attitudes, originality, and complex ideas. 
Choosing and using tests wisely can overcome these difficulties without sacri- 
ficing the benefits that standardized testing can offer. There are four ques- 
tions to answer in planning such a program: When should the tests be used? 
What should be tested? Which tests should be selected? How should the re- 
Sults be used? Of these questions the last is paramount. 

The proper function of a school test is to improve the educational program. 
It may do so by helping plan learning experiences for a pupil, by indicating 
Ways to improve teaching, or by building attitudes in pupils and teachers 
Which will promote better teaching. Once this point of view is accepted, it 
follows that tests are initial, not terminal, parts of the educative process. 
There is little merit in testing after it is too late to profit from the results. For 
this reason, more and more schools are using achievement survey tests at the 
°eginning of the school year. When the results of suitable tests are placed 
M the hands of the teacher in September, they provide a sound basis for 
Planning the year’s work. There is no argument against testing again in June 
to measure improvement, but in fall testing the emphasis is on diagnosis and 
Curriculum planning rather than on marking and recrimination. 

In Euidance, tests are used for the pupil rather than on him. They show 


im his weaknesses, and are a more effective argument for his taking certain 
* 
n habits than is pressure from the teacher. In 


tant to minimize competition and concern over 
ful programs, the pupil takes 


22 or changing certai 
ance testing, it is impor 
th effect of tests on marks. In the most success 
© tests because he wants to know the results. 
It follows that the tests have to measure something of importance. Some 
Schools wil] seek to measure acquisition of subject matter. Others will be 
ge Concerned with educational development defined less in terms of spe- 
© knowledge and more in terms of skills such as interpretation of data. 
n Benera] it appears that the most useful standardized tests are those which 
Over highly general objectives rather than those covering specific content. 
est of ability to reason from a scientific principle to a conclusion about a 
iris Situation is a fair test for almost anyone. 1t is not MEGUSSAEY for the 
ent to have studied either the specific situation or the principle; if he 


Can th; : 
think Scientifically, he can draw the correct conclusion. The GED tests 


Calli à ; , . 
I lling for ability to interpret new reading selections likewise measure pro- 
has studied. 


le 
"Cy regardless of what the person B 

Ssts should not be the sole determiner of the pupil's mark. Equal atten- 
^ Should be given to locally constructed tests of objectives not covered in 


e 
th Standard instruments, and to evidence the teacher has collected from 


e m 
Pupil’s continued class performance. 


b 


CI 
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e e pu- 
It is important to interpret scores in the light of the bm iie et 
The fact that a school is *behin 
pils and of the school program. Feel Tak 
norms is no cause for alarm. The reasons for the ag nee im mini d 
they may not justify any change in the school La ae dur i a» ilit 
school which takes in many pupils from Manresa ein agna cun Ji de- 
properly decide to spend most of its effort during the aha 3 irae A 
veloping English vocabulary, even if this delays instruc io edic aan i 
fifth-grade class which has been enthusiastically composing E d wie 
and poems need not be criticized if grammar has been negle yabr 
process. Such evidence would suggest extra effort on grammatiea bw s 
some other time, but would suggest changing the fifth-grade em : T ri 
teacher thought it possible to improve formal usage while at the sa 
developing creative abilities. — 
dibus testing has been detrimental when it a e 
training rather than educating pupils. So long as tests are consic a 
light of the pupil’s past development and as a guide to fotune ov magne: 
they need have no harmful results, They will have to be improv etale 
these new demands adequately. Tests of limited validity may ser bi 
bly as impartial marking instruments. But when a test bears the — nos 
ity of describing what a pupil knows and can do, and what he ne 


ke b s irection 
tain, it will have to meet a high standard of validity. It is in this d 
that improvement is to be anticipated. 


31. If a teacher knows that a test cont 
given at the end of the fi 
instruction? (Items from 
Book Co., and used by p 


" ill be 
aining items such as the M jr 
fth grade, how will it influence her pes Worl 
Stanford Achievement Test; copyright 1952, 
ermission.) 

l. A chief food of Eskimos is 
fish vegetables fruits 

2. Aman who works with woo 

plasterer carpenter 


cereals 
d isa 
Plumber painter 


3. Each star in the United States flag stands for a 
state city president battleship 
4 


- A large ranch in a mounta 
wool milk vegetables 

- The great pioneer leader in 
Boone Clark Marion Carson :on of the 

€ invention of the steam engine made possible the invention 

reaper locomotive sewing machine Bessemer converter 

~A Popular amusement in ancient Rome was 

chariot racing cricket golf 

32. What effect 9n the high-school social 


to follow if tests of ability to interpret 


Ports, etc.) were given annually to all 
or harmful? 


inous area is most likely to sell 
chickens 


Kentucky was 


a 


P 


N 


d 

ecté 

e eXP* we 

studies curriculum would pe ament al 
data (charts, graphs, 9° penefi* 


Pupils? Wovld this effect be 
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33. Most states require high-school students to study American history as a way 
of developing their proficiency as citizens. Would it be beneficial or harmful 
to give every pupil a test of historical information based on a random sample 
of persons and dates in American history? 

34. A writer says, "As a general rule, no achievement test printed or revised more 
than five years ago, or any other test more than ten years old, should be 

ge Él Do you agree? . , 

* Are there school subjects in which the content in a certain grade ought to be 
uniform for all schools? 


Suggested Readings 
Katz, Martin R. Selecting an achievement test: principles and procedures. Prince- 
ton: Educational Testing Service, 1958. 
This thirty-page brochure (available wi 
Covers the major considerations in selecting I 
tion to a review of reliability and validity as they apply to achievement tests, 
the author considers school characteristics which affect the choice of tests 
Noll me gives advice on how scores should be interpreted. — 
» Victor H. Objectives as the basis of all good measuremen . Introduction to 
educational measurement. Boston: Houghton Mifflin, 1957. Pp. 90-107. . 
This chapter, from a representative textbook dealing with problems of testing 
in schools, describes and illustrates the process of stating educational objec- 
tives and using them to direct test construction and test selection. 
ravers, Robert M. W. The trend toward the measurement of skills. Educational 


measurement. New York: Macmillan, 1955. Pp. 94-115. . 
Travers explains the reason for growing interest in intellectual skills as distinct 
rom mastery of facts, and describes tests used to measure thinking skills and 


Study skills, 


ithout charge from the publisher) 
tests for school purposes. In addi- 


PART THREE 


TESTING OF TYPICAL PERFORMANCE 


14 | 


Interest Inventories 


WE Now turn from the study of ability tests (tests of maximum perform- 
ance) to the assessment of typical behavior. We shall begin with interest in- 
Ventories, paying particular attention to selected inventories which illus- 
trate different techniques of measurement. With these concrete examples 
before us, we shall discuss in Chapter 15 some general problems of obtain- 
ng information on typical behavior. . 

Functionally, interest inventories are closely related to the aptitude tests 
We have been considering in preceding chapters, since their main use is in 
Vocational and educational guidance. An interest “test” is a lengthy question- 


naire, Tt applies the “self-report” technique referred to in Chapter 2, obtain- 


ing i nie " i à NEM 
& information by having the individual describe his own che racteristics. 
regarded as a written interview 


" ; : : 
ew questionnaire or inventory may be : dia 

lich, since it uses numerous rather indirect questions, is in some ways 
more Satisfactory than the direct oral interview. A single direct question, 


ould you like to be a teacher?” does not give adequate information for 
Suidance because answers may be based on ignorance or superficial under- 
Standing of the vocation. A girl may reject teaching for no better reason than 
Nat she thinks correcting papers would be tedious, little realizing the numer- 
Ous other activities in a teacher's day. Likewise, some boys choose law be- 
Cause it calls for public speaking, ienoring its long hours of isolated research 
and thinking. To get around such difficulties, the blunt question is replaced 
Y the indirect, comprehensive, objectively scored inventory. 
, An important advantage of the standardized inventory over the interview 
is the Possibility of comparing responses to those of reference groups. A 
Student may indicate re e likes 25 computational activities out of 80 such 
Activities listed in a — questionnaire. ABl, Vm its fate, appesse not 
to indicate much liking for computational work. But since our culture views 
“omputation more often as work than as fun, this raw score of 25 places the 
: €nt near the 80th percentile for high-school boys. Though he may not be 
trongly attracted to computation, he evidently finds it much less distasteful 
405 
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than most boys do. He is a much-better-than-average prospect for a vocation 


which combines computational duties with duties in which he would have a 
positive interest. 


THREE APPROACHES TO INVENTORY CONSTRUCTION 
Empirical Keying: The Strong Blank 


W psychological assumptions and develops scoring 
ations of responses with Ex 
nce Record, its chief competitor, describes the pa 
vidual in terms of psychological traits (e.g, mechanical interests). ad 
Strong inventory is comparable to the aptitude test designed for a pocas 

occupation, where trial-and-error selection of items maximizes predictive 


s he 
little part in the test construction. T 


the test is developed. m 
nsists of questions on hundreds of activities both poea 
S require a "like-indifferent-dislike ‘al 
8y, fishing, being an aviator, ded 
o select activities that adolescents wou 


: :noful 
to imagine, rather than activities that become meaning 
only as a result of work experience, 


Assignment of ltem Weights. 


Y Which the 


percentage of men-in-genera] 
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percentage of men-in-the-occupation giving the answer. Engineers dislike 
"Actor" more commonly than other men; therefore, response D is assigned a 
positive weight in the Engineer scale. The weighting is proportional to the 


TABLE 55, Determination of Weights for Strong's Engineer Key 


Differences in 
Percentage Percentage Be- 
First 10 Items of “Men-in- Percentage tween Engineers Scoring Weights 
9n Vocational General" of Engineers and Men-in- for Engineering 
Interest Blank Tested Tested General Interest 
| L I b" IID L l D L I D 
Actor (not 3i 
movie) 21 | 32 | 47 9 | 31 | 60 | —12 —1 13| —1 0 1 
Advertiser 33 | 38 | 29 | 4| 37 | 49 | -19| —1 | +20 | 2|] 0| 2 
Architect 37 | 40 | 23 | 58 | 32] 10} +21 | —8| —13 2|—1|-1 
Army officer | 22 | 29 | 39 | 31 | 33] 36] +9] +4] -13] 1| 0|-1 
Artist 24 | 40 | 36 | 28 | 39 | 33 +4 =] —3 0 0 o 
Astronomer | 26 | 44 | 30 | 38 | 44 |18| +12] 0| —12] 1| of -1 
Athletic di- : 
rector 26 | 41 | 33 | 15] 51 | 34] 11 | +10] +1] — 
Auctioneer 8|27|é5| 1/16/83} —7}-11] +18] 1| —1] 2 
Author of novel | 32 | 38 | 30 | 22 | 44 | 34| —10| +6] +4/-1] 1| 0 
Uthor of tech- 
nical book 31 | 41 | 28 | 59 | 32] 9| +28} —9 | —19 3|—-1|-2 


Souncr: Strong, 1943, p. 75. 


difference, Liking to be the author of a technical book is especially common 
among engineers; since it is a significant indicator of engineering interests it 
I5 given a weight of +3. In contrast to engineers, who tend to dislike act- 
Mg, 40 percent of artists respond “Like” to “Actor.” The weights of “Actor” in 
° Artist scale are +2 for L, 0 for I, and —1 for D. 
Ccupational scores are converted into letter grades Tangmg from A to 
` Seventy percent of successful men in the occupation fall into the A group 
on that scale, The interests of a person who falls below B+ are quite dif- 
erent from those of the bulk of the occupational group. Only 2 percent of 
è men in the occupation fall as low as C. . 
trong’s key is based on no psychological theory about engineers; he relies 
“ntirely 9n test data to define what engineers are like. Some of the weights, 
Such as 49 for liking to be an architect, fit our expectations. Other weights 
aY seem quite Elite Liking to write a novel lowers the Engineer 
Ore and disliking such work counts zero, but being indifferent counts 
“A few weights are illogical because they come entirely out of the nu- 
Foe findings (some of which are chance effects) and are not influenced 
9 author's ; 
he empirical kee ae poe or less heterogeneous mixture. In. Table 55, 
ten responses weighted for the Engineer key encompass interest in 


Se 
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" chnical 
mathematical-scientific subjects ( Architect, Antronomen, gon es i novel, 
book), dislike for verbal activities (Actor, Sacre —— ec 
Auctioneer), indifference to Athletic director, and liking c : ideae 
The remainder of the key—to give only a few em cue c eographic, 
weight on the following likes: calculus, chemistry, sii 2 a m 
repairing a clock, writing reports, and improving the Eg : : are give 
These items all reflect scientific-technical interests. Small w eign ; x 1 iiie 
to numerous miscellaneous likes not obviously related io s ei An 
ing long walks, symphony concerts, military drill, talkative peop el fon 
ous treatment from superiors. Since these scattered items have yn iU 
fluence on the score than the many highly correlated technical iten ane 
can be neglected in psychological interpretation of the nig ia 

Keys for as many as 47 male occupations (Printer, oar a al 
available. There is also a women’s blank which can be scored for inven- 
Stenographer, Dentist, and 24 other occupations. The items of — d 
tories, like those of the biographical inventories mentioned in T can be 
so varied that they can be used to predict almost anything. A new xe arigi- 
made for any vocation or specialized group. For example, eps wd 
nally provided separate keys for accountants, office workers “er d prac 
keepers, and certified public accountants. A later study of nearly of certain 
ticing accountants, however, found that only about 40 diee dp 
CPA subgroups make a score of A on CPA whereas 70 percent of m key for 
occupation are expected to make A. Strong therefore prepared a ney msgid 
“Senior CPA.” The original scale seems to apply well to partners amodo 
accounting firms and was renamed the “CPA Partner" scale. The to me? 
ant" scale seems to apply to junior accountants, and probably also 


mer” sone 
did “Partn 
who move from accounting into business management. The “Par . 
Stresses verbal inte 


ica 
: i ematici 
rests; the new “Senior” scale involves mathemé 
terests and h: 


na 
er a 

: z n ; as Lawye 

as à negative relation to such verbal interests as Lawy 
Advertiser, Senior C 


stron 
PA and Partner CPA scores correlate only 07 ( 
1949), 


Strong keys are no Y 
Which men give more frequently than women, for example, a pace 
femininity key” was Prepared. In principle, the test could also be 
give an indirect me 


3 dency> 
asure of scholastic aptitude, of neurotic ten 
of financial credit. 


in answel® 
t confined to vocational interests. By scoring ulinity” 


ores OU ffi- 
Was first produced, calculating weighted sc oP 
many keys was extremely laborious. Fortu 


rano 
ave now been develo 


Jy: 
vif 


f, 
á “Acto 
Chemist scale, of the item 
responses of chemists are as follows: 16 P 


ercent L, 34 I, 50 D. 
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2. Estimate the weights for the Musician scale, of "Actor," if responses of musicians 
are 34 percent L, 48 I, 18 D. 

3. Suppose you wished to make a key for the Strong blank for women to measure 
interest in being a mother, i.e., to predict whether a girl will enjoy raising a 
family. Outline the steps you would follow to prepare the scale, with special 
attention to the persons you would use as a basis for the key. 

4. Each of the following assumptions is implied in the construction or in some uses 

of the SVIB. For each one, state a contradictory hypothesis that might be rea- 

sonable. 

a. One is not likely to succeed in an occupation unless the work is interesting 
to him. 

b. One is not likely to succeed in an occupation unless his interests are similar 
to those of most other men in the profession. 

€. Interest in the school subjects required for preparation for a profession is not 
an adequate basis for predicting satisfaction in the profession. 

d. The interests leading to satisfaction in a vocation in 1930 will also be associ- 
ated with satisfaction in 1970. 

Research psychologists generally 

Sary, yet liking for mathematics is 

for Psychologists. How can this seeming inconsistency be explained? 

‘An A rating in psychologist with B+ in physician and dentist should suggest a 

different preparation and career than an A rating in psychologist with B+ 

rating in engineer, production manager, and carpenter” (Strong, 1943, p. 54). 

What differences in advice are justified in these cases? 

How might an interest test be used to distinguish, among prospective teachers, 

those likely to be traditional subject-matter teachers from those likely to em- 

Phasize the development of the pupil as a person? Outline a plan for research 


to develo 
P such a procedure. 5 * " T 
8. Kuder's eer: inventory (not to be confused with his Vocational inven- 


tory) is scored empirically by weighting items in a manner pig Sconti. 
Kuder mentions the following principles used in developing his scale. Comment 
9n the reasonableness of each principle. 


b The vocabulary should be kept simple. " m : 
* To keep obvious vocational significance of the item to a minimum, iteras 


should not consist of occupational titles to be checked as liked or disliked. 
C. It is generally more important to sample a large number of relevant areas 


than to obtain large samples of only a few areas. -- 
* When the purpose of a test is to differentiate between groups, reliability 


Within hin the group of engineers) is relatively unimportant. 

9. groups (e.g., within the group x : : A 

Would Strong meee his Engineer key by discarding weights which do not 
seem logical even though the item in question shows a difference between Engi- 


ne : 
ers and men in general? 


enin Clusters. Although Strong original purpose was to make predic- 

S about suitability for specific occupations, his test is used equally often 

i obtain a general description of the person being counseled. Such a de- 

"Iption must organize the responses in terms of psychologically meaningful 

aits, Factor analysis of the vocational keys has produced a set of descriptive 
“its for the SVIB. 


find considerable mathematical work neces- 
assigned a weight of zero in the Strong scale 
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i i i for 
The analysis indicates the following clusters of occupational interest 
men: 


Group I, Creative-scientific: Artist, psychologist, architect, physician, 
dentist. ] 
Group II, Technical: Mathematician, physicist, engineer, chemist. 
Group III: Production manager. . idis 
Group IV, Sub-professional technical: Farmer, carpenter, printer, ma 
ematics-science teacher, policeman, forest service. ICA 
Group V, Uplift: YMCA physical director, personnel manager, YM 


" $ . inister. 
secretary, social science teacher, school superintendent, min 
Group VI: Musician. 


Group VII: Certified public accountant. nt 
Group VIII, Business detail: Accountant, office man, purchasing agen» 
banker. 


Group IX, Business contact: Sales manager, real-estate salesman, life in- 
surance salesman. 

Group X, Verbal: Advertising man, lawyer, author-journalist. 

Group XI: President of manufacturing corporation. 


Special keys for those grou 
pared, so that the counselo 
thus arrive at a me 
Scoring may find it 


Ps involving several occupations have been pr€ 
T can score the blank on these eleven factors P 
aningful overall description. The counselor using RE 
efficient to score the blank for the occupational groups i 
a first stage in counseling, and then to apply specific occupational keys wd 
for occupations which seem important after discussion of the group-key pe 
"stage scoring is inefficient, however, wh 
ronically, 


s sea 2 2 
aging criticism of the use of group keys. The w^ 
recommends that couns 


give attention to 


groups 
he scores A, P 


Strong has designed "map" for 
chart represents 


ns. The 
P the surface of a 


I 
globe. The record shown in Figure T rbi 
8 from Group V over the "North Pole" to Group X- 


. so senio 
ogy major tested in his S 
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é : , "-— AN 3 ver 
year. His interests emphasize “verbal” and “uplift” activities, with lowe 
scores on accounting and selling. 


10. If the student represented in Figure 73 has done good academic work and = 
been satisfied in his courses in psychology, what possible vocational aims ar 
suggested by the test? 

11. There are many groups in which this student has low scores. Which of these 

lacks of interest would be significant in deciding against certain positions In 

psychology? af 

The three dimensions (up-down, left-right, front-back) of the chart TEPE 

the three chief interest factors in the SVIB scores. How might these factors 

named? " 

13. How would one use the SVIB in counseling a boy who is considering becoming 


bres : s ccu» 
a librarian or an English teacher, since there are no keys for those o 
pations? 


12. 


Homogeneous Keying: The Kuder Preference Record 


The evolution of Kuder's inventory was almost exactly opposite to that li 
Strong's. Kuder began with a factor analysis of single items in order to gol 
tify clusters of interests, and then organized these items into descriptV 
scales. The scales were used in educational and vocational guidance eyon 
though predictions rested on inference rather than evidence of predici 
validity. With the passage of time, information on the predictive validity 9 


" an 
Kuder profiles has been collected. Today scores for specific occupations E 


ie 
be constructed from the Kuder profile just as for the Strong, although the} 
strument is still used most often as 


a trait description. di 
We shall discuss primarily Form C of the Kuder Preference app 
Form A, also in current use, is a personality test (see p. 496). Form B is F 
early version of the vocational inventory, now replaced by Form e «o 
Form D is a recently developed Set of questions designed to yield spec 
occupational scores like those of the SVIB and not intended for descriptio? 
Thus Form C best illustrates the development of descriptive keys. iag ia 

Kuder identified ten clusters of occupational interests, a cluster being " 
group of items which have substantial correlations with each other. p 
4 group is said to be homogeneous, ie. there is a common factor running 
through the items. The ten scores constituting the Kuder profile are: - 
door, Mechanical, Computational, Scientific, Persuasive, Artistic, Litera? 
Musical, Social Service, and Clerical. ' 


ash d — Es " fof 
Each item is in the forced-choice” form. Three activities are listed, 
example: 


a. Develop new varieties of flowers. 
b. Conduct advertising campaign for florists. 
C. Take telephone orders in a florist shop. 
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The Subject is to select the one he likes most and the one he likes least, 
leaving the third unmarked. A person who chooses “a” as most liked receives 
credit under Scientific and Artistic; choice of “b” scores as Persuasive; and 
choice of “e” is counted as Clerical. These scorings are not arbitrary; the 
items are counted in that key whose other items they correlate with. Judg- 
Ment entered the test construction only when Kuder decided what items to 
use in his original tryout. 

The occupational interpretation is usuall 
highest scores in the profile and referring to a list of occup 
those scores are believed or known to be relevant. According to the test 
manual, a “3-6” profile (i.e., one with highest scores in categories 3, Scien- 
tific, and 6, Literary) suggests the occupations author, editor, reporter, 
Physician, surgeon, psychologist, and etymologist. —. M 

Kuder scores are most often interpreted on the basis of their “common- 
Sense” meanings. A person like Mary Thomas whose profile (Figure 74) 


y made by identifying the two 
ations for which 


Musical 


Computa— 
tional 
Scientific 


Artistic 


| NN 


E "CN 
: T 
- FRE 


Kuder Preference Record. (Adapted by permission of Science Research Associates, 
Publisher.) 


S P A A 
me high Clerical and Computational scores presumably will enjoy posi- 
ini emanding such activities. The Jow-interest areas are also important, 

wee the person might dislike work demanding such activity. In Mary 


Omas? " i ative. She was majoring in 
as case, the interest test was highly inform m 
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re 
child development in college at the time she took the test. Her pem 
mediocre, and her work with children was not especially success ic kr 
questioned regarding her choice of major, she explained that Y naa d 
her heart on work in an orphanage. This desire had arisen in chi d this 
when she read a book about a woman who helped orphan children, T Aer 
had seemed to her a “wonderful” thing to do as a lifework. The low pes 
scores in Persuasive and Social Service activities su ggested a pene 
drawn personality, while the high Mechanical, Computational, and pt 
Scores suggested a liking for routine, uncreative activities. When ques sari 
about office work, she enthusiastically described her previous d 
work as a file clerk; her duties apparently consisted solely of alphabe 


in secre- 
folders, yet she had “just loved it." Moreover, she had done well je sh 
tarial training courses, Evidently both ability and interest fell in an a 
had not considered 


as a vocational goal, 


14. What tentative conclusions c 
lowing percentile scores: O 
Scientific, 70; Persuasive, 9 
Service, 40; Clerical, 15? 

15. A boy majoring in business admi 

and Social Service. He is near a 

a high score (78th Percentile) i ienti 


r fol- 
an be drawn about a college man with ii 30; 
utdoor, 60; Mechanical, 50; gp Social 
8; Artistic, 70; Literary, 90; Musical, 50; 


sive 
— "wn in Persua 
nistration shows high interests in P has 


ess 


16. A person's absolute j 


rest 
guidance be based on relative or absolute sco 
17. What high and low poi 


ing foremen, 
Male Photograpers. 


t 
A sd ne 
fa ational Keys When Kuder Published the first form of his d as 
in 1940, he suggested interpretation solely on logical bases. This requi! f 


ies 0 
es 
. " tudi 
Crests relevant to various occupations, and s 


ere needed to val 
less as o 


der* 
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persons (46 percent) provided data (Baas, 1950). Separate tabulations are 
made for the 27 clinical psychologists, 96 counseling psychologists, 29 theo- 
retical psychologists, and 29 industrial psychologists. For all psychologists 
combined, the median fell at the S4th percentile (of men in general) in 
Scientific and Literary, between 60 and 70 in Computational and Social 
Service, and at or below 30 in Clerical, Mechanical, and Persuasive. The 
only noticeable differences between subgroups were in Artistic (Theoretical 
and Clinical above 60, others below 40) and Social Service (Clinical and 
Consulting above 70, others below 50). 

The results found by Kuder generally support the logical expectations. 
The median profile for accountants shows peaks in Computational and Cleri- 
Cal. Authors, editors, and reporters have a peak in Literary, chemists in Scien- 
tific, musicians in Music. At the same time, there are enough departures 
from expectation to demonstrate that logical presuppositions must be tested 

D. N. Wiener, 1951). The median for Engineers, for example, is 64 in Me- 
chanical; 68, Computational; 73, Scientific. These depart from 50 in the ex- 
Pected direction, but not very far; and many engineers are below average in 
One or all of these scores. Camp counselors of the YMCA might be expected 


to have distinctly high Social Service scores, but they average only at the 


69th percentile, being equally high in Persuasive, Musical, Artistic, and Lit- 
erary, 


Regression equations may be used to combine interest scores into a com- 


Posite which distinguishes men in an occupation from men in general. The 
est simple formula for identifying carpenter interests counts Mechanical 
Positively and gives equal negative weights to Scientific, Literary, and Cleri- 
cal. A formula of this type was highly effective in separating carpenters from 
men in general and from men in other trades (Mugaas and Hester, 1952). 
"der profiles can therefore be transformed into occupational scores like 
Ose for the SVIB. In practice, such translation is uncommon, because 
Sounse] Iping students toward a general self-un- 


ors are more interested in he 
ar occupations for them. 


er: ; : 
Standing than in selecting particul 


lo: 
gical Keying: The Lee-Thorpe Inventory 


ti oe is still a third approach to questionnaire isse iid Occupa- 
Interest Inventory by Lee and Thorpe is 2 Set ot questions selected 

. ganized on the basis of judgment rather than on statistical grounds, 
'S Which we may refer to as a “Jogical” approach, contrasts with the 
Ong and Kuder procedures, which depend primarily on statistical find- 


in : : 
55. The logical approach is similar to the technique of constructing pro- 


ci ; GR 
ency tests by defining a universe of situations and selecting items ran- 
mly £ 


rom that universe. 


9e and Thorpe took as their starting point the description of occupa- 
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" iven in the Dictionary of Occupational Titles. This spirant i 
bene. is the USES defines and classifies virtually all American CA ad 
Wilts sah of six areas, tasks were selected to represent HN m = Tis 
low levels of responsibility. Brief job descriptions are Lewes 2: wo 
subject indicates which of the pair he prefers. Scores indicate 


iness, 
; shanical, Busine 
frequency of choices in the Personal-Social, Natural, Mechanical, 
Arts, and Sciences categories. 


This inventory differs from the SVIB in th her than On 
classification of items is based entirely on job descriptions par n from 
empirical evidence that persons in the job like the activity. It : F analysis 
the Kuder in that the first grouping of items came from a logical 4 for i 
rather than statistical isolation of factors. The Mechanical sum. lathe, 
stance, includes a great variety of tasks. labeling bottles, je iiti ae ele- 
making drawings with ruler and compass, repairing shoes, i van Such 
vator, testing the strength of steel structures, designing airplanes, cipal SF 
a heterogeneous category is difficult to interpret either diesen such à 
predictively, Knowing that a person likes half of the activities "Although 
mixed group tells us little about what jobs he will find satisfying. 1 criteri. 
the original classification of items was based completely on logica eneit: 
a subsequent correlational analysis was made to improve homog 


"es 
»gory were eli! 
Items which had nothing in common with the rest of the category W 

inated in revising the test. 


ion and 
at the original selection 4 


Relations Between the Inventories 


Initially, the three invento 
niques. Strong star i 
which go with r 
logical nor psyc 
cupational cate 
tional field, If 
that item; Lee 


tech- 
rests 


i i ifferent 
ries were designed by quite differ E 
nd searches out those ! 


de 
uties which fall within tite -— 
ngineers like mountain ape 0 
ver, would find such an item ure an the 
ngineers have to check aep a Mechan 
d presumably include that in the ipatio?? 
€ the task. Kuder ignores the a, sum 
he searches for a set of traits whic 


ical score even if engineers dislik 
structure at the o 


statistical evidence. , 
Despite the differ, 
tories have conver 


factor analysis of 


ences in in: 
ged on mu 
Strong keys 
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Sible to translate SVIB occupational scores into a trait description. Through 
use of regression equations and profiles of occupational groups, Kuder 
Scores can identify occupational categories into which people fit. Lee and 
Thorpe have begun to purify their scales to increase interpretability and if 
they desired could collect occupational norms for the test. In principle, there- 
fore, all tests can fulfill all functions. Each has its own characteristics, but re- 
Search gives no definitive answer as to which approach is best. 

The inventories measure approximately the same interests and the corre- 
Sponding keys have substantial overlap, as can be seen in Table 56. This 


TABLE 56. Selected Correlations Between Strong and Kuder Scores 


Kuder Scales 


Compu- Social Cleri- Persua- Liter- 


SVIB Artis- Scien- Me- c É 
Group keys tic tific chanical tational Service cal sive ary 
Creative- " " 
scientific AS 34 
Scientific- " : : 
technical 67 58 «T 
Uplift * * .39 .30 
Business 
detail * .50 x .60 .36 
Business 
contact * * * * .29 70 
Verbal * AE * .48 
Astoria miale adhituntial megutise,celstistships he wenmeining:canelationsiars 
between a a 


y .80 and —.30. 

Sounce: um 15500; sce also Triggs, 1943. 

the Strong and Kuder instru- 
hborhood of .50-.70 for closely 
any two inventories, 


tab š 

m i gives correlations for selected scales of 
e : j 

" nts. Most of the correlations are in the neig 
Orres à ; Ee 
"responding scales. The corresponding scores for 


do s P 
Wever, inyolve a substantial amount of independent content. 


v 
ALIDITY OF INTEREST MEASURES 


Stab; 
Ability of Interests 
The first 
Measure as 


interest tests for counseling is that they 


assumption in usin s 
: : ability is not enough to estab- 


table characteristic. Evidence of st 
Validity, but it is a necessary first consideration. 

F trong, in his extensive follow-up studies, finds that interest scores are in- 
Sed stable after age 17. When Stanford students were retested after an 
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. $ i rou I) 
whereas the "applied research" group score lower in ^ id Men oa 
occupations and have more interests in common es . e both being 
sonnel. The production and sales engineers Ps "a » zi ones Dunnette 
interested in sales and office tasks and not in d o types, and 
developed special keys for separating the engineers E Pus fn acsi 
found that he could correctly classify two-thirds of peres aum procedures: 
validation group. It is of interest to note that he tried ria reati 
weighting item responses in the usual Strong een edes edm 
Strong's occupational keys in a regression formula. The "wá aho attin 
equally well, but the weighting of scores was much simpler s d Horn, 1939; 
tional scores had previously been calculated. (See also Estes an 

and Tucker, 1959.) ;ho are 
o tests can discriminate men satisfied in a job from bet A 
dissatisfied. Perry (1955) divided Navy yeomen (clerks) ee the 
dissatisfied groups, according to whether they said they iig kéy ofithë 
same service career if they could start over. On the Office Wor er ; de dis- 
SVIB, the mean score of the satisfied yeomen was 48 while that "s signifi 
satisfied group was 21. (The s.d. being 33, the difference is hig a d high- 
cant.) A guidance service which had given the Kuder inventory ork they 
school seniors and adults asked them, a year or more later, what "det Mei 
were doing and how well they liked it. The investigators then "S the job 
person according to whether his tested interests were "suitable neral men- 
he held. A similar judgment was made regarding his measured pe pp 
tal ability. As Figure 76 shows, interests do forecast satisfaction, an redictor. 
bination of interests and ability taken together is an excellent y stion Ï$ 
Further evidence that interest differences predict future satis T ens 
found in Strong's studies (1943, pp. 114 ff.) of men who change 


II the fol- 
field to another after leaving school. His follow-up study supports a 
lowing statements: 


Men who remain in an oc 
scores for that occupation t 
Men continuing in an o 
than men who try the oc 
Men who change from 
Which their interest Scores 


higher 
cupation for ten years or more average £ 
han for any other. ‘gayest 
" nter 
ccupation have higher scores in that i 
cupation and change. one in 
one occupation to another change reet 
were about as high as for the firs 
The correlation of int 
.20) in Strong's study 
pears to be that amo 
their work usually re 
conditions, and role 
tion. 


crest scores with professed satisfaction is nee - 
and in other studies he cites. The prineipal ese in 
ng college graduates even those with low "i working 
Port satisfaction, presumably because qos satisfac” 
in the community play a large part in jo 


P llege in" 
Strong's most impressive data (1955) are those showing that co E 
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terest scores predict what occupation the man will actually be engaged in 
eighteen years later. Among men later employed in an occupation, five times 
as many had A+ ratings in that occupation in college, three times as many 
had A— ratings, and one-fifth as many had C ratings as among men em- 


Group A (25) Group B (34) Group C (49) 
Both interests and Interests suitable, Interests 
abilities suitable abilities not unsuitable 


“Best possible job for me” LILI IA 
"Like it very much" 
"Like it fairly well" 


“Indifferent to it” 


“Dislike it” 


FIG. 76, ability of interests (Lipsett and Wilson, 1954). 


Dependence of job satisfaction on svi 


Ployed in other fields, In the Stanford sample an A in Engineer indicates 
9ne chance in three of becoming an engineer, one in three of entering a re- 
lated Occupation, and one in three of entering an occupation having little 
vesemblance to engineering. This indicates good validity, since a man may 
ave several A's yet can enter only one field of work. 
McArthur (1954; McArthur and Stevens, 1955) challenges too simple an 
assumption that interest scores ought to predict what a person will become. 
© argues that many forces other than interests determine what field a per- 
Son will enter. Particularly, the family of the well-to-do boy may dictate what 
eld he will enter, and provide the economic support to assure success. 
re en the presumably wealthier Harvard students who had come from pe 
bow Schools were considered separately from those who were public- 
Schoo] graduates, a marked difference appeared. The private-school group, 
Fs Beneral, entered an occupation corresponding to their claimed interests 
E Dot to their measured interests. The public-school group (generally up- 
ie middle-class boys) entered fields corresponding to measured 
sts, 
a evidence does not deny that the SVIB accurately measures interests 
«., UPper-class boys. “It is not,” Darley says (Gee and Cowles, 1957, p. 26), 
at the Strong doesn't ‘work, it is that you don't need it when students’ oc- 
Cupationa] c : rmined by the subculture from 


whi 'hoices are so completely dete i i i; 
pr ich they come. All you need to do is to ask a boy in this particular private 
ep s 


is chool what he is going to be and you get the right answer since this is 
Md predetermined by his entire environment. The Strong may truly re- 
*t another pattern of motivation which his subculture does not allow him 


o » 
Use, and this is the tragedy of that subculture. 
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For those entering professions or skilled jobs, interest measures pa 
meaningful. In the occupational world, however, a large S s fey 
are essentially routine and offer little possibility of self-fulfillment. As Dé 
and Haganah (1955, pp. 8-9) point out, 


ts 
Only at [professional, managerial, and skilled] levels do studen 


E x bs d to 
tend to say "that would be an interesting job." We as adults also ten 


^ š per 
feel that the really “interesting” jobs are to be found only in the upp 


categories and that many workers are doomed to tasks pni 
little training, repetitive and routine activities, and rather D 
ing or unchallenging work assignments. . . . In the various job sati be 
tion and morale studies, a crude division of responses appears bee " 
related to the hierarchy of Occupations. Respondents at lower € 
tional levels stress as sources of satisfaction economic factors, geris 
a chance to get ahead, a need for recognition as persons. yo EE 
at upper economic levels define satisfaction in terms of "intere 


sex 
» i f i A irces e 
Work". . . For the former group, satisfaction derives from sot 

ternal to the work. 


n- 
There probably are no special patterns of interest characteristic of ud 
skilled occupations. Strong developed keys only for responsible pos d to 
—even his Group IV covers only skilled trades. When Clark (1950) trie i 
develop interest keys for various jobs by Strong’s method, he found one 
tinction among men in various unskilled trades. He did, however, gon 
in differentiating skilled trades from each other. The few Kuder ape 
able on lower-level groups support this view. Average profiles for s are 
groups as filling-station attendants, department store help, and painter 


zile 
similar, being much flatter than the profiles for professional and ski 
groups. 


This result makes sense wh 
whose aptitudes we discusse 
wires together; wherein lies hi 
his pleasure must come not 


' unte? 
en we think about job duties. The mo 


d earlier, spends hour upon hour vi 
S vocational satisfaction? If he is to be di ship: 
from the work itself but from en 
d freedom from responsibility. The very € e ive 
nvironment, presenting new situations ei a rou" 
» While some people can be content i 
ommand active interest. 


enin£ 
sfiec , 


only 20 


19. Darley and Haganah esti critici 


5 ver 
mate that the Strong vocational keys —— 
Percent of the male Working population. Does this constitute a seri 
of the Strong scale? 


20. Production engineers ayera 


Strong Engineer scale, Wh 
in guidance? 


i verage Cor es 
9e only B— and sales engineers a 


o 
ong se 
at does this imply regarding the use of Str 
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21. Assuming a normal distribution of Office Worker scores, and the same s.d. in 
both groups, sketch distributions for satisfied and dissatisfied yeomen in Perry's 
study. How well does the Strong distinguish these groups? 

22. Expressed interests of private-school graduates predict what they will be doing 
more accurately than their inventoried interests. Does this mean that the SVIB 
is an invalid indicator of their interests? 

23. How is the interpretation of Strong's follow-up studies influenced by the fact 
that interest scores were discussed with the subjects during college? 

24. In addition to the original Physician key, Strong and Tucker developed two 
sets of keys for medical specialties such as surgery. One key is based on items 
which differentiate surgeons from men in general, the other on items which 
differentiate surgeons from a mixed group of physicians. Which of the three 

keys could best be used 

a. for guidance of a college freshman? 

b. for advising medical-school seniors about their careers? 

| officers to duty and advanced training? 

Among veterans who planned to enter engineering, a success and failure 

group were distinguished by Barnette, according to whether they continued in 

engineering training. How do you explain the fact that the Kuder Mechanical 
score, though high in both groups, had no relation to continuance whereas 

Computational had a very marked relation? 

Do you agree with the following statement? 


25 c. for assigning Army medica 


26. 


"Insofar as stated choice of occupation by groups of individuals (high school 
girls) may be considered a true criterion of interest, the lack of relationship 
between statement of occupational choice and interest scores » - + may be 
considered evidence of the lack of validity of the interest inventories. 


Prediction of Occupational Success: Only a few studies have examined 
Whether the usual interest scores predict job performance. The most ade- 
quate studies are Strong’s investigations of insurance agents (1948, pp. 486- 
300). The Strong scores predict either ratings Or records of business pro- 
duced, with correlations of about -40. Men with A scores in sales interest 
wrote, on the average, $169,000 per year of new policies, whereas C men 
pied only $62,000. A few C men, however, were unmistakably success- 
ul, 


E, E. Kelly and D. W. Fiske (1951) tested students entering training for 
Clinica] psychology with predictors of all types: ability measures, personality 
Questionnaires, performance tests of personality, and interview ratings. Four 
Years later they collected such criteria as grades, scores on performance 
tests, and ratings by training supervisors. Particular interest attaches to the 
ratings on Overall Clinical Competence and Research Competence. In a 
Study with hundreds of predictive scores and a dozen criteria, no single 
coefficient is dependable, but several findings regarding interest tests 
merged. The Kuder proved to have rather little predictive value: out of 

correlations, only 16 (13 percent) reached .20. For the SVIB, which has 


426 ESSENTIALS OF PSYCHOLOGICAL TESTING 


i se 
more scales, a total of 677 coefficients were determined, = It jii 
(21 percent) reached .20. Except for the Miller Analogies ek a m 
of verbal ability, no test yielded better predictions than the : gial ies 
the larger correlations for the overall clinical and research cri x : k es 
marized in Table 57. The correlations may well have been x ji : e e 
stricted range in the group and by unreliability of the ratings. z : ser 
ests and creative-scientific interests appear to be associated with ‘ese p 
clinical psychology, while interest in business activities is esci e 
low ratings. There are some differences between those high in = a oa 
those high as clinicians, the former being stronger in scientific m ys pen 
distinctly lower in business interests. Strong's original Psychologist = 2 
pared in the 1920's, emphasized the interests of research ee dn 
here correlates quite high with rated research competence. The scar 
(1949) keys for various types of psychologists show that among in 


o attern 
characteristic of psychologists there are numerous patterns and each p 
is relevant to a different type of success. 


TABLE 57. Correlation 
Clinical Psych ology 


a 


Correlations with Ratings on 


s of SVIB Scales with Ratings of Trainees in 


Overall Clinical Research 
Competence Competence 
334 
Group |: Artist, Architect, Physician 010  .22 22 to 
Group Il: Mathematician, Physicist, 6 
Chemist ' —.04 to +.09 2710 3 
Group Ill: Production Manager —25 —.10 
Group V: Personnel Director, Social — 01 
Science Teacher —.03 to +.06 —.15 to —.! 
Group VIII: Office Man, 


Purchasing 
Agent, Banker 


—.24 to —08 — —.301to —25 
Group IX: Sales Manager, Life 
Insurance Salesman 
Group X: Advertising Man, Lawyer, 7 
Author-Journalist ' 2410 35 4510 2 
Psychologist keys: 


Original Stron 


0210 07 —.24 to —21 


g Psychologist 18 43 
Kriedt Psychologist 20 38 
Kriedt Clinical 26 01 
Kriedt Experimental —08 16 
Kriedt Guidance 00 — 22 
Kriedt Industrial 


91 —.22 
m= E Kus AUN NET T e eene 


E. L. Kelly and D. W. Fiske, 1951, pp. 150-155. 


Interest inventories have shown ne 
vocational training. The Aiy Force co 
gories with grades in thirtee: 
below .20 (Brokaw, 1956), 


sin 

ra ucces J 

gligible value for predicting seit cate 
rrelated scores on various inte 


e 
; wer 
t ns 
aining Schools. Almost all correlatio 
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Many students confuse interests with aptitudes and misinterpret the in- 
terest test as a measure of what they can do best. Interests obviously tell 
nothing about abilities; in general, the correlations between interests and 
Corresponding abilities (e.g., between Kuder Clerical and DAT Clerical) 
are close to zero. A high interest score should be interpreted as indicating 
that if a person survives training and enters the occupation, he is likely to 
enjoy his work. Though interests imply motivation, their influence on suc- 
cess is rather small. The Frederiksen-Melville study (p. 845) explains this in 
Part. They found that grades of “compulsive” students depend only on 
abilities; such students make an effort whether interested or not, and their 
interests have no predictive value. Among noncompulsive students, how- 
Ever, interests predict achievement with validity .36-.55. While it is danger- 
us to generalize from this one study, it seems reasonable to conclude that a 
Person with interests and abilities suitable for an occupation can and will do 
Well in it, a person with suitable abilities but unsuitable interests can do well 
but may not, and a person with suitable interests and low aptitude will do 
badly (cf. Fig, 76). 

ficient prediction of success cannot gener 
Scores based on differences between men-in-the-occupation and men-in- 
Benera], It is necessary to establish differences between good-men-in-the-oc- 
CUpation and poor-men-in-the-occupation. The characteristics differentiat- 
ans might have no resemblance to the pattern 
average man. Only a few 


ally be expected for interest 


"ng Sood from poor veterinari 
stud guishing successful veterinarians from the 

les have developed keys distinguishing good from poor men. One study 
Salesmen and servicemen for office equipment has demonstrated that this 
method of treating interest inventories may have substantial predictive 


Power (Ryan and Johnson, 1955). 


27 a TP 3 
* Assume that certain interest scores of veterinarians are distributed in the man- 
ner described below. What advice would a discriminant key of the Strong 
type lead to? What advice would be given if expectation of success within the 


field were considered? 
a. In Outdoor interests, veterinarians are higher than the average man, and 
their success is positively correlated with the interest score. 
b. In Persuasive interests, veterinarians have the same average asmen In gene 
eral. Persuasive interests are positively correlated with success in the field. 
In Social Service interests, veterinarians tend to be below the average for 
28, all men, and the correlation between interes 
? You agree with this opinion? 


ts and success is positive. 


"Various criteria have been suggested in contiectioti with Totartiuned zavi- 
Se ing. . Sreca is offen employed as a criterion. lt is more appropriate 
In connection with aptitude tests than with interest tests. But it is doubtful if it 
'S as good as it seems. Fifty per cent of people must always be less successful 
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: ear 
than the average. Counseling evaluated on such a basis must always app 
rather ineffective" (Strong, 1955, p- 11). 


Prediction of Academic Criteria 


There is no overall interest pattern significant of superior academic osi 
formance. Among several groups studied at Yale, overachievers sis uu 
sistently a bit higher in scientific, uplift, and verbal scores on the Str rine Had 
lower on business scores, but the overlap of the overachievers and ur 
achievers was very great ( R. M. Rust and F. J. Ryan, 1954). Other pna 
such as that of Kelly and Fiske confirm this result. Correlations of in " d 
with grades in specific fields or courses are generally below .30, ee g 
plies that interest tests add only a small amount to formulas for pre edict 
grades. Interest scores may predict persistence even if they do not p d Bt 
grades. In one study of dental students 92 percent of those with A ae E 
Scores on the Strong Dentist key graduated, compared with 67 perc 
B's and 95 percent of C's (Strong, 1943, p. 524) d dif- 

Segel (1934) found definite correspondence between interests an Engi- 
ferences in achievement between courses. The correlation of Rats is is 4 
neer interests and mathematics-marks-minus-history-marks was .61. pe me 
finding of great potential importance in classification and guidance, but `, 
fortunately other investigators have not a 

Specially constructed keys have had s 
Various “studiousness” keys have been m 
on items which distinguish good 


it. 
ttempted to confirm and pipe 
ome success in predicting £T in 
ade for the Strong blank by p 
and poor achievers. Mosier (1987 J arts» 
that studiousness scores and grades correlated .47 for students in liber: jors 
-2A for engineering students lidity 
Though such 

must be estab 


; has 
average marks, but no technique ! view? 
atisfactory enough for practical use, The only approach that can he inter 
Ba A 
specific marks by means of speci ederi 


ical grounds, as in the studies of Segel and F* 
sen and Melville, 


INTERESTS AND PERSONALITY 


s Wkeliho ity’ 
ap : í 
> ona 
honest self-report. Interests Bive clues regarding adjustment and P 2d 
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Highly intellectual interests or concentration in some field where one has 
developed unusual competence may be an attempt to withdraw from fields 
Where one cannot be sure of superiority. 

Some persons with conflicts arising from self-criticism find satisfaction in 
activities which others find monotonous. Mathematical and clerical work, for 
example, appeals to some workers who need to be sure that what they do is 
right, Having added a column of figures and checked themselves, they can 
feel an assurance they could never have after writing a story or planning a 
Party, or in some other activity where "rightness" is less objective. There 
are others who can be satisfied only when imposing their own individuality 
"pon their work. Such people frequently dislike routine or stereotyped ac- 
tivities but respond eagerly to artistic tasks where originality is essential. 

This Suggests going behind the interest test score in an attempt to infer 
the type of personality consistent with the interests. Because it relies on the 
Insight of the interpreter, any such attempt is open to error. A good many 
empirical studies (Darley adi Haganah, 1955, pp. 108-133) have found 
Modest relations of interests to personality tests, but clinicians and coun- 
Selors have not found these studies very illuminating because they reveal 
little about the nature of the stresses within each personality. Clinicians 
therefore fall back on cumulated experience with individual cases. Rarely 
'5 such experience collected and systematized; the most complete research of 
this character ig Anne Roe’s work on eminent scientists (1952, 1957). Her 

ndings, based on interviews and projective tests, deserve close study by 
“nyone concerned with vocational counseling. 

The implications of personality interpretations are well illustrated by these 
Comments on medical students ( E. L. Kelly in Gee and Cowles, 1957, pp. 
185-196). 


Asa group, the medical students reveal remarkably little interest in 
the welfare of human beings. For example, one of the sharpest distinc- 
tions I can find between a group of physicians at Michigan and a group 
of clinical psychologists whom we have been studying is [the physicians’ 
higher] Farmer score on the Strong... . The Farmer key . . . is 
based on the modal interest pattern of highly successful graduates of sci- 
entific agricultural schools. . . - Such persons are not scientific in the 
Sense that they want to discover new truths; their concern is rather the 
application of science toward the goal of increasing production. . . . 

Another characteristic of medical students is reflected by their rela- 
tively high scores on the Aviator scale. . . . The one thing they [vari- 
us kinds of pilots] have in common is maleness and a lack of interest in 
anything cultural. 


Our data suggest that if you want to select the kind of lad who is go- 
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ing to be interested in public health, general practice, and so on, e 
should pick the person with a high Strong score on the Carpenter key. 


s : ition in our 
This is a person who has a relatively low upward mobile ambition i 
society. 


Kelly's data, from incomplete research in one medical school, e 
that interest scores shed some light on the role the person is likely to p s 
form within his profession. Among other criteria, sociometric ratings Me cb 
tained from the student's peers, indicating (1) his social relationships, like 


EH a 
hood of becoming a hospital administrator, and personal acceptability a$ 
colleague, and (2) 


laneous group of Stron 
Kelly's 112 cases, the 


ath-Science Teacher, Physicist, and Dentist; negative wee 
, and the Strong keys for sales 202 
cess in medicine or any other P i 
not complete when the perso? 

ation a available on personality eo 
m a study at the University of Cali 129 . 
Darley and Haganah, 1955, pp. 128-4 


ing in social poise, lacking c 
poorly to stress, Sympathetic; an 
cendant. 


It is important to 
indicating “good” 


y 
: a 
!5 open to challenge. Most contemp ord gent 
to assume that the ideal personality is € 
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interested in social contacts, and effective as a leader. Roe (1952), however, 
points out that many distinguished, highly effective, and apparently con- 
tented physical and biological scientists are not at all socially oriented. They 
care little for making friendships or for earning the good opinions of others. 
Eminent and effective psychologists (including laboratory experimenters), 
9n the other hand, typically are concerned with having good relationships 
With others. Roe finds that both groups had had difficulties with social rela- 
tionships at some time in their preadult development, and believes that each 
group chose a different method of adjusting successfully to these difficulties. 
The physical scientists became absorbed in tasks not involving other persons, 
while the psychologists made other persons their professional concern. This 
leads Roe to question whether psychologists, merely because their personali- 
ties have now crystallized about an active relationship with others, build 
Such a relationship into the definition of “good adjustment” they apply to oth- 
ers. Quite possibly, says Roe (1953), the psychologists are critical of effective 
and healthy patterns of adjustment which do not coincide with their own. 
Conversely, if physical scientists were to define the healthy personality after 
Studying all the data available to psychologists, their ideal might place little 
emphasis on warm friendships and ability to lead, and a great deal of em- 
Phasis on responsibility, freedom from suggestibility, and independence of 
group opinion. This argument is supported by another study of ratings of 
Sraduate students at the University of California. To the clinical psycholo- 
Sists, “soundness” of personality depends strongly upon warmth in anterper= 
Sonal relations, and eccentricity or deviation from the norm is regarded with 
SüSpicion, When faculty members rate the same students, however; sound- 
Ness” iş judged almost entirely by the student's effectiveness in getting his 


Wi 
ork done (Barron, 1954). 


USE OF INTEREST TESTS IN COUNSELING 


Interest inventories are rarely used for selection or for administrative deci- 

Sions about classification, even where suitable scoring formulas permit valid 

Prediction, Historically. interest tests have always been a method for helping 
© individual attain satisfaction for himself rather than a method for satis- 
mg institutions. As a result, the interest inventory is used almost entirely in 
“Ademic and vocational counseling. . A 

le ne may conceive such counseling as intended to mda at a decision— 
29 OF Selecting a definite goal and working out a training and career pin 
* One may conceive the counseling as intended to promote the client’s un- 


“standing of himself. More and more, counselors are shifting to the second 


Point of view, Aswelhave pointed out earlier, vocational development neces- 


Sari] s s ý — 
i vailable, as the indi 
Y involves new choices as new facts become a 2 vidual 
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matures, or as his social circumstances and oppartonitias change. T E 
dent and counselor in high school may set down a definite plan dn 
tain subjects, to enroll in a certain college curriculum, to complete met 
in a certain professional school, and to find an opportunity to i eis ra 
type of practice. This plan has an almost negligible probability o ^ e E a 
ried out. Somewhere along the line instructors will open new vistas o 
student or arouse new interests. Somewhere along the line concrete A call 
ence will show him that he does not enjoy some aspect of the work "M bn 
reveal an unsuspected talent in another direction. Counseling shoulc enel 
tically assume that any plan is a road with many branches. In a gooc " d 
most of the branches are conceivably appropriate for the client to epson E 
he is able to reach any of the goals which at present seem most appropri? 

If the goal in counseling is not to be definite pl 
immediate decisions, the 
decisions as choice 


anning, save with regard - 
goal must be to equip the student to make en 
-points are reached. The aim in counseling should nda 
give the student a more sophisticated view of the world of work, o 

choices open to him, and of his own ra 


8 of potentialit iev! ent 
P tenti li ies for achie em 
d satisfaction. 


io 
Interest inventories are peculiarly well adapted to vocational | ut 
The student expects his interests to be considered, and he is not threa inter- 
by the questionnaire as he might be by personality or ability tests. The : d 
pretation, when given, carries considerable force, because the studen 
see that he is looking at himself in a mirror, that 
sis of what he himself has said. No psychologi 
terest test as they do tests involving more eso 
and aptitude. From the counselor's point of vi 
is less fraught with emotional significance. Tl 
student his aptitude and personality tests sc 
dence that he can accept and comprehend 
tests can be discussed freely, however; whil 


to examine discrepancies within his self- 
teem. 


he is only receiving an ana 
cal mysteries becloud ra 
teric constructs of persona * 
ew also, the interest ipe / 
he counselor hesitates to t€ di 
ores unless there is ample ett 
the findings. Scores on an i 
e they may require the p^ p 
concept, they rarely threaten 7 


ink 
For the counselor or high-school instructor who wishes to encourage tn 

ing about future plans, the interest inventory is a helpful device. aes to 
given to entire classes or entire student bodies. Students are quite wi 
reveal their interests and are eager to havea report of scores. a 
is some risk of misunderstanding, interpretation of profiles can be qo pP 
in group discussions rather than in individual counseling (Layton, ib 
32 ff.). Such a Process, leading each student to list vocational poss urth 
suggested for him by the test, is an excellent preliminary either tO 
group study of careers or to individual counseling. 


T 
othe 
a ; i š : an 

The interest Inventory also assists counselors in dealing with m y 
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student problems. A promise to interpret interest scores is an excellent, non- 
threatening gambit to entice the student into the counselor's office. In the 
course of the discussion of vocations he will necessarily talk about his family, 
his social relations, and his academic difficulties, and so may touch upon 
problems for which the counselor can provide help ranging from a diagnos- 
ue reading test to psychotherapy. The conference opens a natural opportu- 
nity for the student to express his desire for such help, a desire which he 
might otherwise never have acknowledged even to himself. 
R In view of the aims described above, it is most unwise to concentrate the 
Mterview upon an analysis of scores for specific occupations. This tactic gives 
the student far too narrow a description of himself and leaves too many 
things out of consideration. It is absolutely essential that the student should 
80 beneath occupational labels and stereotypes, that he should understand 
the var iety of roles different members of the same occupation play, that he 
should understand the differences between demands of the training pro- 
Sram and demands of the occupation, and that he should recognize the shift- 
Ing nature of occupations. He must consider his abilities and academic pros- 
Pects, the pressures from his family, his motivations and values, his financial 
resources, and the probability that his present interests may shift. 
à Darley and Haganah (1955, p. 195), speaking from this point of view, 
ee criticize some common practices in vocational counseling. They take 
an example the student with peak interests in the social service group. 


At some point in the counseling interview series, the counselor can 


make this bald statement: “You have the same kind of interests as suc- 
cessful personnel managers or Y.M.C.A. secretaries or school superin- 
tendents,” With minor modifications, this is probably the standard ap- 
Proach to interpretation. It is also the least effective approach and the 
One most likely to lead the student and counselor into ever deeper mo- 


Tasses of interpretive difficulties. 


or condemning this approach, most of which we 


€Y give eight reasons f A 
such an approach immediately 


v , 
in touched on already. Most specifically, suc' a tones c 
t, 55 the student to think in terms of occupationa stereotypes, instead o 


€ to see what interests of his match activities common in the jobs men- 
he ed. Moreover, since he may attach a negative connotiuon to—et us say 
TAN, chool Superintendent” if he sees himself as a business execuum inan all- 

9 world, the student may find it necessary to resist the test interpretation. 


Ustead of 4 narrowly occupational interpretation, counselors should help 
© student ; d ivities in which he has expressed inter- 


a 


u 
E $ D 
Sion, » Broups of the Strong are simil 


igh score in literary interests, for example, can be amplified by 
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questioning which will clarify whether this is an interest in reading, i hos 
ing, or in speaking; whether it is an interest in face-to-face verbal activities 
or in isolative verbal activity; and whether it is accompanied by any evidence 
of talent in expression. The discussion will ultimately come around to specific 
vocations used as examples of ways in which the expressed interests might be 
satisfied. Such illustrative vocations can be selected by the counselor in the 
light of the student's claimed interests, his probable ultimate level of educa- 
tion, and his abilities, They may or may not correspond to the limited num- 
ber of occupations for which keys exist. 

It is particularly necessary to reconcile differences between claimed inter- 
ests and measured interests in a way that is emotionally acceptable to the 
student. To have told Mary Thomas, “You don’t really want to work in child 
development; you want to be a secretary,” would have precipitated an emo- 
tional conflict. No one can abandon a long-standing self-concept easily. An 
authority who bluntly contradicts firm beliefs invites the counselee to reject 
him as an authority, In Mary's case, it might have been better to inquire aS 
to the reasons for her choice of child development, to ask her to envision the 
activities she might be engaged in ten years hence, and to compare those 
with the activities rated high in the interest blank. The fact that the inven- 
n ratings brings her face to face with her apad 
gist is no longer the "authority"; he is merely ho 


most highly developed 


ranks very near the top among psychological tests of all types.’ The sp adit 


! complex weights are rather inconvenient to score x 
hand and, indeed, may be no more valid than keys with unit weights. Tmo 
E Over its competitors will not find the cost and de 
Severe handicap. The scoring charge is currently » ail 
great number of keys make interpretation both rich 4 


ah its re 
'ams. But its length and complexity, together with M 
; make the Strong the preferred instrument of most hi£ 


dy 
1 The ai . a indi Snor 
€ penpals a ce bed In painstaking test construction is Deka 1 Y btained, 
th theipoutee of te as Spent, and blanks from over 23,000 people v 
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The Kuder blank has a much simpler format. Even ninth-graders can take 
the test in groups, score their own tests, and plot their own profiles. There is 
evident danger if this invites teachers to leave interpretation to the students, 
but this is not a necessary fault of the inventory. The scores lend themselves 
directly to interpretation in terms of patterns of activity, with vocational 
interpretation secondary. For this reason scores of the Kuder type seem 
slightly preferable to those of the Strong type, especially in the hands of 
Counselors with limited training. It is of interest that Darley, a leading advo- 
cate of the Strong test, urges that it be interpreted in terms of interest cate- 
Bories such ag technical, social service, business detail, and verbal-linguistic, 
rather than in terms of the occupational scores per se. While the Strong can 
be so treated the Kuder is designed for just this use. It appears more suitable 
than the Strong for girls, and for students headed for lower-level occupa- 
Hons. Both the Strong and the Kuder inventories are long and tedious, which 
makes them somewhat unsuited for application to large groups. Canfield 
(1953) has shown that Kuder profiles are not greatly altered if only the odd 
Pages are administered, thus cutting testing time in half. He has prepared 
norm tables for this short form. (See also Clark and Gee, 1954). 

Inventories such as the Lee-Thorpe, developed on a logical or content- 
Sampling basis, are much harder to evaluate. The items are more directly 

*SCriptive of vocations than are those of the Strong and Kuder, and are 
therefore more likely to invite responses on the basis of stereotypes. A more 
Serious difficulty is that one cannot say whether the category scores represent 
Suitable constructs for describing individuals. The heterogeneous mixture 
of Activities called “mechanical” by Lee-Thorpe define a less clear interest 
Pattern than the mechanical items of Kuder, whose intercorrelations have 

een established empirically. The Lee-Thorpe inventory would be a more 
Usefy] instrument if considerable empirical work were done to revise the 
Stoupings and to provide a background of facts with which to interpret 
Scores, In the absence of such facts, scores on Lee-Thorpe categories appear 
to deserve little emphasis. The items, covering a wide range of occupations, 
May be regarded by counselors as a checklist or pencil-paper interview. In- 
numerable leads for interviewing will come out of consideration of the sepa- 
Tate items. 

A final consideration in choosing between inventories is a statistical com- 
Parison of reliabilities. intercorrelations, and relations with criteria. Unfortu- 
nately, the available foment Goi is spotty at best, since only a few isolated 
Studies have administered two or more inventories to the same sample. The 

Tong and the Kuder are about equally reliable, with the Strong having a 

ght advantage. The “corresponding” keys sometimes agree very closely but 
at other times seem to have different psychological meanings. Differences of 
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this kind indicate how important it is for the counselor to become thoroughly 


i is usi i indicatin 
familiar with the particular test he is using and with the research indicating 
the meanings of scores. 


29. The Thurstone Interest Schedule consists of a set of paired comparisons of oa 
cupational titles such as Engineer-vs.-Accountant. Titles are assigned to po 
on a logical basis. The choices made in a given area are counted, yiel us 
scores in ten areas such as Physical Science, Business, Linguistic, and bes 
tarian. The scale requires about ten minutes. Discuss the advantages and dis 
advantages of such an inventory for counseling. r h 

30. The Thurstone profile is expressed in terms of percentage of choices in eac 
area. No norms are used, interpretation being based on the shape of the raw- 
score profile. Is this advantageous or disadvantageous? 


31. Compare the extent to which the Strong and the Kuder are influenced by re- 
sponse style (p. 372). 

32. An interest test is to be used in hel 
tional plans. Among the traini 
graver, dietitian, 
keys. How could 
they would like t 


ping junior-college freshmen make ind 
ng programs offered are those for eee 
and others not directly represented in the Strong and eed 
each test be extended to assist students in judging whet 
hese fields? Which test seems to be more adaptable? " 
33. What errors are likely to occur when ninth-grade students score and interp 


A : ion? How 
their own Kuder Profiles, in the course of several days of class discussion? H 
can the teacher reduce such risks? 


PROSPECTIVE DEVELOPMENTS 


Interest testing and test interpretation h id 
test was first published, and there is much reason to expect continued rap 
development. The initial investigations were blunt empirical comparisons E 
occupational groups on items selected almost at random. There was no oe 
ory as to the nature of interests or as to the types of interests most deserving 
consideration; there Was no theory about the structure of occupations an 
careers; and the interpretations placed on test scores were entirely que 
matic and unpsychological. Kuder’s approach and Strong’s factor analys " 
led to the beginning ofa theory of interests. Guilford and his associates, n 
tentative but comprehensive study (1954), found over twenty interest fa n 
tors, most of which seem to reflect general personality styles rather than = 
cational Orientations, Guilford recognized several familiar factors such 4 
mechanica], Scientific, and social-welfare, but he adds adventure vs. gooit 
aesthetic appreciation, cultural conformity or orderliness, need for divom 1 
aggression, and many other interest dimensions. Longitudinal studies wi 
case histories have also begun to present a clearer picture of the significan 
of interests. The original crude empiricism has declined in np 
The next development that may be forecast is the systematic interpre 
tion of interests in terms of more fundamental personality constructs. 


P "ong S 
ave changed markedly since Strong 


a- 
he 
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work of Roe, Tyler, Kelly, Guilford, and the California investigators all repre- 
Seng preliminary steps in this direction. Whereas interests were once viewed 
almost as a product of chance conditionings, today it is thought that interests 
are an expression of deeply rooted needs and adjustment patterns. Interest 
tests are perhaps superior to many other techniques for assessing personality 
because of their diverse content and their acceptability to the subject. But 
considerable research must be done to place such interpretations on a sound 
footing, 

Every current writer on interests and on vocational counseling stresses 
the great need for a theory of interest development and for a theory of occu- 
Pational adjustment to replace the present piecemeal collections of facts. 
Much current research is intended to produce at least the beginnings of such 
a theory, and as it emerges this theory will no doubt have radical effects 
upon test interpretation. It will also reopen questions as to the appropriate 
age to begin the measurement of interests, and the appropriate items to be 
Used, 


LISTING OF INTEREST INVENTORIES 


Among the interest inventories currently in use are the following: 


© Kuder Preference Record, Occupational, Form D; G. Frederic Kuder; 
Cience Research Associates, 1956. A collection of 100 forced-choice items 
drawn from the Kuder Vocational and Personal inventories. Intended for in- 
Stitutions which wish to develop keys to place the most suitable persons in 
Particular jobs. For use in guidance, Kuder is developing and releasing keys 
for various occupations. The 1957 manuals discuss keys for 22 occupations, 
Including those of electrical engineer, farmer, minister, etc. The information 
available to date on discrimination between occupations is encouraging, but 
unti] longitudinal studies and correlations with job satisfaction are avail- 
able, it is not possible to judge whether this inventory can become as service- 
able as the much longer Strong inventory. . 

* Kuder Preference Record, Vocational, Form C; G. Frederic Kuder; 
Science Research Associates, 1939, 1951. For high-school students and adults. 

descriptive blank yielding ten scores showing the person’s percentile 
Standing in various interest categories. (See pp. 412 f.) 

9 Guilford-Shneidman-Zimmerman Interest Survey; J. P. Guilford and 
Others; Sheridan Supply Company, 1948. For high-school students and 
adults. An inventory based on factor analysis which identifies nine categories, 
Sach of which has two subscores (€-£« aesthetic appreciation vs. expression ). 


i uilford’s later work suggests revision and exten 
j trument is primarily suitable for research on interest development rather 


an guidance in its present stage of development. 
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€ Minnesota Vocational Interest Inventory; Kenneth E. Clark; unpub- 
lished. For high-school students and adults. Can be expected, when pub- 
lished, to fill an important place in counseling and classification. Forced- 
choice triads are scored empirically to indicate how closely an individual's 
interests resemble those of men in various trades such as bakers, plasterers, 
retail sales clerks, and truck drivers. The inventory thus covers a portion of 
the occupational range for which the SVIB is inadequate. 

€ Occupational Interest Inventory; Edwin B. Lee and Louis P. Thorpe; 
California Test Bureau, 1943, 1956, For Grades 7 upward. Yields scores for 
six fields ( personal-social, arts, business, etc.) and another set of scores for 
verbal, manipulative, and computational interests. (See pp. 415, 435.) 

€ Strong Vocational Interest Blank for Men; E. K. Strong, Jr.; Consulting 
Psychologists Press, 1927, 1951, with supplementary research reports. For 
high-school students and adults. The outstanding example of an empirically 
scored interest inventory. Keys for 47 occupations, plus group factors. (See 
pp. 406 ff.) 

€ Strong Vocational Interest Blank for Women; E. K. Strong, Jr.; Consult 
ing Psychologists Press, 1947, 1951. For high-school students and adults. 
Scores for 27 occupations. This instrument has not shown satisfactory validi- 
ties and is rarely used. In counseling women who plan to enter occupations 
for which the men’s blank is scored, it is preferable to use the men’s blank. 

9 Vocational Interest Analyses; Edward C. Roeber and Gerald e 
Prideaux; California Test Bureau, 1951. Grades 9 and up. To be used as a 
second step, following rough mapping of interests by the Lee-Thorpe mu 
tory. This instrument has six sections of 120 items each, corresponding to = 
sections of the Lee-Thorpe. The counselor administers those sections corre 
T onding to high scores on the first test to obtain a more detailed analysis = 
interests within the area. As no evidence of validity is available, the pe 
tory should be regarded as 4 written interview rather than as a scored 165" 
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showing how the use of SVIB information varies according to the client's 
aptitudes, maturity of self-concept, and background influences. Of particular 
interest is a history (Karl Brooks) showing development from age 14 to 25. 
Kuder, C. Frederic, Research methods for development of an occupational key. 
Research handbook for the Kuder Occupational Preference Record Occupa- 
tional. Chicago: Science Research Associates, 1957. Pp. 27-38. 
This is a brief account of the procedures used in developing and testing the 
efficiency of a key to distinguish persons of one type from men in general. 
Strong, Edward K., Jr. Interpretation of interest profiles. Vocational interests of 
men and women. Stanford: Stanford University Press, 1943. Pp. 412—456. 
Strong presents data on typical patterns among college students and suggests 
how the relevant information can best be conveyed to the student seeking 


guidance. 


15 


General Problems in Personality 
Measurement 


: sed in Chap- 
PERSONALITY, attitude, and interest measures were introduced in 


iJ- 
: istinguished from the abi 
ter 2 as measures of "typical behavior," and thus distinguished from 

ity tests, which measure m 


à hav- 
aximum performance. In assessing typical el 
ior, the investigator wants to know what the person normally does ate 
than what he can do under exceptional motivation. In this chapter, W 
examine the notion of “typical behavior” 
ering and interpreting information 
sion, it would be wise to reread the 
tigate typical behavior given on p 


h- 
and compare various Ways d dm 
about it. Before proceeding to this aves” 
introduction to procedures used to ! 

ages 31 to 34. 


TYPES OF DATA 


Observctions in Representative Situations 


ly in situation 
Which we are int 


ings with employees, 


; 5 -yations: 
The first requirement is a sufficient number of suitable observatiO"” 


te 

ý edia 
one act can be taken as typical, since it is influenced by mood, ve e 
preceding experience details of the Surroundings, and other factor on -. 
are cycles and trends in b 


" me e 

ehavior, If a subject appears quarrelsot pem h 
eral occasions, quarrelsomenegs Seems typical for him. Perhaps, how m nt 
de cdi Mrs e e 
1s In a continuing state of irritability due to some worry, and som pw 
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took too small a sample and as a result had observed a mere temporary devi- 
ation. Yet the deviation was real, and the behavior reported was typical for 
the subject during that time. 

Careful attention must be paid to definition in attempting to observe typi- 
cal behaviors. One can be sure what a report of typical behavior means only 
When the observer specifies the range of time represented in the data, the 
range of situations, and the range of motivations. When these are not speci- 
fied formally, they are often implied in the description of the observation 
method, 

The second procedural requirement is that the act of observing must not 
alter the behavior observed. Just as the presence of a traffic cop at an inter- 
Section raises drivers from their habitual level to their best ability, so the 


Presence of the observer may cause the subject to try harder. This seems to 


Occur even when no reward or punishment will result. Roethlisberger and 


Dickson attempted to compare work output under various conditions at a 
Western Electric manufacturing plant. Relay assemblers were placed in a 
small experimental room where they could be observed and their output re- 
corded in great detail. Various experimental rest pauses and privileges were 
Introduced; as each change was introduced, no matter what it was, produc- 


tion climbed, Finally, in the twelfth and thirteenth periods, the rest pauses 


and Privileges were removed, and production per hour still remained as high 


as under the “best” working conditions. Another striking change was that ab- 
Senteeism dropped from 15.2 days per year per worker before entering the 
Study to 3.5 days per year in the test room. The heightened morale of the 
Workers—as a result of being singled out for study, of being better ac 
quainted with their supervisors, and of feeling personal responsibility for 
their rate of output—changed their performance so that it was no longer 
Comparable to that in the regular workroom. 

istortion is less when the judging is a regular part of the work procedure. 
ngs by foremen can reasonably be regarded as reports of typical behav- 
lor in the plant, for the foreman is usually present; how the man would act if 
no foreman were provided is not of interest. Wherever the rater or observer 
à regular member of the group, his presence will have little distorting ef- 
ect, 

An ideally random sample of the subject's total behavior can never be ob- 
Served, Those moments of his life which are open to the psychologist's in- 
Spection are by no means typical. It is a fantasy to think of assessing generos- 
ity by tabulating the businessman's responses to appeals; these private 

ation. Observation in representative situa- 


Mo; 

" ments are not open to observ À : : : : 
-.9Ds can be used only to learn about the individual’s typical public behavior: 
in Classrooms on playgrounds and in certain work situations. Indeed, direct 


Obs : » aa 
€rvation of samples of “natural” behav 


ati 


ior is restricted almost entirely to 
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research, particularly research on children who are young enough to ignore 
the presence of the observer. 


1. Define the range, in time and situations, of the behavior which should be studied 
to answer these questions: 


a. How well does this supervisor handle grievances? 

b. Does study of philosophy make an adult more rational in his daily life? 

€. Does viewing a film on nutrition improve housewives' practices in menu plan- 
ning? 

d. Do graduates of the modern elementary school write legibly? 

e. How anxious is this patient at this point in therapy? 


Reports from Others and from the Subject 


If we are willing to sacrifice the precision and detachment of the scientific 
observer, we can obtain useful information from the subject's acquaintance 
and coworkers. The rating by a foreman is more nearly a general uer) 
than a dependable record of typical behavior, but it is nonetheless useful 


ME > E s jent$, 
Similarly, mothers give information about children, nurses about patien 
and so on. 


: i E 
I5 report. We shall discuss the interpretation 
self-reports at some length below, 


Performance Tests 


e — 

Obtaining reports from others, or self-reports, avoids some of the pepe" 
ties of field observation, Such reports can (in principle) shed light 0? in 
ners of the subjects life Where the observer may never go, can cover P 
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behavior which is no longer observable, and can take into account far more 
incidents than any observer could record. These benefits are offset, however, 
by the distortions which result when subjective impressions are substituted 
for precise quantitative records. Performance tests seek to obtain precise and 
dependable information—but they do so by giving up the attempt to take a 
Tepresentative sample. Just as in a measure of aptitude or achievement, the 
tester places the individual in a standardized situation to which he must re- 
Spond. His performance is evaluated either by an objective performance 
Score or by observation of the way he responds. 
: A great variety of techniques fall into this category. One might measure 
interests, for example, by a current events test covering developments in sci- 
ence, engineering, music, public affairs, and so on. Knowledge in the various 
fields is to some degree a reflection o£ relative interest. One might allow the 
Subject a supposed "rest period" during a battery of other tests and let him 
TOowse in a library; the books which attract his attention might be presumed 
to represent his interests. A third approach is to require him to make up sto- 
rites about pictures showing people at work in settings such as a hospital op- 
erating room, The ideas and feelings he attributes to the characters in the 
Pictures may indicate his attitudes about various types of work. 

There is no accepted classification system for performance tests. Cattell 
has Proposed that the term objective test be applied to devices like the cur- 
Tent events test of interests which yield a direct measure of performance un- 
Modified by any observation or interpretation. This name, however, is not 
accepted by other workers. The name situation test has also been applied to 
tests of performance in complex, lifelike situations. It was first used for work- 
Samples of leadership. The candidate for a leadership position was placed in 
? standardized situation, given a crew of men, and observed as he directed 
them, Another subcategory is the projective technique. A projective tech- 


nique gives the subject material with which to work creatively; e.g., the 
tester ]us (inkblot, picture, unfinished story, 


presents an ambiguous stimu à : 
ete.) and asks the subject what he sees in it or what he thinks will happen 


Next. These interpretations are regarded as projections of the subject's un- 
Conscious wishes, attitudes, and conceptions of the world, 
‘ € great advantage of the performance test is that it permits fair compari- 
on of individuals. A rating of leadership may reflect differences in opportu- 
nity rather than differences in readiness to lead, but a performance test 
Slves each individual in turn the same opportunity to lead. Individual dif- 
erences in use of that opportunity reflect personality. Behavior in this stand- 
ardizeq Situation may be far from “typical.” At best, we obtain a sample of 
response to a very special stimulus, namely, a-leadership-opportunity-when- 
"Ing-tested by.a-psychologist-whose-good-opinion-will-have-certain-conse- 
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i ; ion is far more 
quences. The performance test Bives neat data, but interpretation is far vd 
i i ions i ative situations or 
difficult than interpretation of observations in representative situatio 
ports from others. 


2. How well can the four Procedure: 


à e 
undistorted by perception of thos 
T " " uring 
c. The data Provide q summary or estimate of the individual's behavior d 
all moments of his life. 


. ood 
d. The results are the same, whether or not the subject wishes to make a g 
impression on the Psychologist. 


THE SELF-DESCRIPTION AS A REPORT OF 
TYPICAL BEHAVIOR 


š av- 
The simplest view of the self-report is to treat it as a record of typical eni is 
lor, which the Subject is in a uniquely excellent position to observe. T ssl 
Some justification for so interpreting the interest inventory, since the its 
seeking guidance wants his interests to be satisfied in the work he € re- 
Even the interest inventory, however, is not proof against distortion as = 
sult of status aspirations., In other questionnaires, there are many een 
distortion which prevent accepting the score as a true summary of be aaa 
The first difficulty in questionnaire interpretation is that items are ied 
what ambiguous, “Do you make friends easily?” seems a straightfor an 
question, but it is hard to Say just what behavior the question refers to, 
what the tester means by easily. Qu 
ior, is unable to count up particular incidents. If he could, his report W me! 
statement. But he will recall some cases where he for! 


tation of the question 
and intimate compani 


ot 
; ] oes n 
subject taking a questionnaire d 


3 e- 
ch fussy questions (though a scientific observer tabulating R P 
ave to). The subject answers the question in terms 0 " who 
à "concept. If he regards himself as being the typ ues 
makes friends ©asily—hang niceties of definition!—he says “yes” to it re 
4 Y Popular boy may have a different self-concept 8 
spond “no,” 
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der eai. arises because most questionnaires ask about responses to 
tin Metical typical situation, instead of asking about response in well- 
E situations. “Do you seek suggestions from others?” is a fairly clear 
die ion, but most people would have to answer, “Sometimes I do, but not 
" = This might be further qualified: “I do on difficult problems”; “I do 
i ing is around whose ideas are especially good ; “I don't if I'm sup- 
Vien E ot the decision myself. These qualifications would have to be 
pia he subject tried seriously m report typical behavior. Since he can- 
eios age his memories to determine what percentage of the time he has 
PA suggestions, theguestion will be answered offhand. When one person 
"fair yes" to mean "with very few exceptions and another defines it as 
hi d often; at least in difficult situations," they are answering different 
pati and their responses are not comparable. Another example is the 
on. a y clear item: “Do you like to operate an adding machine?" Many 
Pie E say that they enjoy this but would be dissatisfied with a job where 
ind nothing to do but operate an adding machine. It is impossible to 
y items to eliminate such problems of interpretation. 

S Many self-report tests provide a response scale using such words as “al- 
ays, frequently,” “seldom,” and “never.” Simpson (1944) examined how 


TABLE 58. Range of Meanings Assigned to Words 
Commonly Used in Personality Inventories 


What Percentage of All Occasions Is 
Indicated by the Word at Left? 
Range of Answers 
of Middle 50 Per- 


Median 

Answer cent of Subjects 
MEINE. ou 
Usually 85 70-90 
Often 78 ys 
Frequently 73 s 
Sometimes 20 13-35 
Occasionally 20 10-33 
Seldom 10 6-18 

5 3-10 


Rarely 
hour Oo 


Sovnce: Simpson, 1944. 
antitative observations. He asked stu- 
particular response would correspond 


[o] 
mone d that this was what they “usually” did. Twenty-five percent of them 
Other o usually" only to events occurring at least 90 percent of the time; an- 
he percent said that "usually" meant à fregueney below 70 percent. 
s ot tative interpretation of other words is shown in Table 58. It is 
nt that two subjects with identical behavior may choose entirely differ- 


ent 
adverbs to describe what they do. 


Such rat 
atings might compare with qu 


ents 
What percentage frequency of a 
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Response Styles 


The use of fixed response categories such as “Yes,” “Agree,” and ET 
makes questionnaires particularly subject to individual response "-—— 
was first noted by Lorge (1937; see Cronbach, 1946, 1950), who d ifs 
how often people responded *Like" to SVIB items. One subject ica dia 
word for every activity he would not positively dislike; another app a on 
word only to activities to which he is strongly attached. Such difference 
lead to quite different Strong profiles, . — 

Where response style has considerable effect it becomes difficult aln P 
sible to interpret self-reports as if their face content were true rp ot 
Messick, 1958). The California F scale is a questionnaire developed het 
search on authoritarian personalities (Adorno, et al., 1950). The nd 
sist of strongly worded opinions most of which express a critical pp 
about human nature, People who endorse these items tend to show añ 
symptoms of readiness to follow strong, repressive political leadership, 
hence the scale is labeled F, for “fa 1 report 
years following the publication of this scale, that it gave a dependable 
of the subject's attitudes. Later, it 
the scale were worded in hostile ] 
consistently scored as undesirable. 
the score reflected the content of atti 
To investigate this, a "reflected" sc 
ternate version was Written which 


x We 
had the opposite ostensible meaning. 
may label this scale F”, 


Then a pair of items might þe: 


(F) Obedience and respect for 
should learn, 

(F’) Self-reliance and lack 
virtues children shoul 


. y hildren 
authority are the most important virtues C! 


* ortant 
of need to submit to authority are the most imp 
d learn, 


In one study the number of statements marked in the authoritarian m 
on the two scales & 


rela- 
"Yes on F, “No” un FY correlated only .20. The m the 
tion would be .50 or beyond if responses were determined primarily 19545 
content of attitudes ( 


Bass, 1955; Messick and Jackson, 1957; Ancona, 
Chapman and Campbell, 1957), 


anne 


Faking 


oh 
. whic 
The tester Would like to view his inquiry as a scientific project phe s 
the subject is willing to contribute valid information. The subject co an 
the test with a quite different Purpose. In a clinical test, he may W 


: : : is first CO” 
avoid certain threatening diagnoses, In employment testing, his fis 
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cem is to land the job. In vocational guidance, he may be more concerned 
With convincing the tester that he should enter a certain occupation than 
with learning the truth about his suitability for it. When industrial workers 
filled out identical health questionnaires under two conditions, the results 
Were strikingly different. One questionnaire was turned in to the company 
medical department, as a preliminary to a medical examination designed to 
improve the worker's health. The other questionnaire was mailed directly to 
à research team at a university. The workers listed far more symptoms on the 
research questionnaire (which would not help them) than on the other, even 
though an honest report to the company physician might bring them medical 
help (Streib, unpublished). 

As might be expected, the subject most 


able light (Edwards, 1957). On an item (e. 
Where one response is socially desirable, the great majority of subjects will 


Sive that desirable answer. The subjects tendency to make favorable state- 

ments about himself, i.e., to put up a good front, is often referred to as a “fa- 

Sade" effect, Striving to make a favorable impression can be identified by 

Counting how often favorable self-descriptions are checked. A high façade 

Score may occur, of course, because the person is truly superior in behavior 

and adjustment; but persons whose behavior approaches the ideal on many 
mensions are so unusual that faking is suspected. 

Not all the subjects “fake good.” Some deliberately give an unfavorable 
Pina of themselves. A draftee who believes that a poor score on a personal- 
Yi Questionnaire will get him a discharge may report an astonishing array 

emotional symptoms. In an ordinary clinical test, exaggerating symptoms 


may be a gambit to enlist sympathy and attention. The subject may prefer 
tudent are due to emotional dis- 


often presents himself in a favor- 
g., Do you make friends easily? ) 


turbance i 
than to be thought stupid or lazy. 
o be thought stup s evaluation of psychotherapy is the 


ut on borderline responses he selects unfavorable alternatives. This may be 
take his problems seriously and make 


h awareness of symptoms. 


4. Of “Thanks, Doc. I feel fine.” This may i 
at the sacrifice of time, money. and privacy was not foolish. One important 
moti, i f gests, is the client's desire to repay 


ation of faki thaway sug! 
ing good, Hathaway h help he has given. It would be un- 


*rapist by letting him see how muc 
Statefu] indeed for de client to dwell on the symptoms the therapy had left 


touched. On his exit questionnaire the client may be disposed to give him- 


448 ESSENTIALS OF PSYCHOLOGICAL TESTING 


self (and his therapist) the benefit of all the borderline decisions. The num- 
ber of symptoms is thus below the number reported at intake. The ~~ 
may have produced genuine improvement, but true improvement is har 
distinguish from change in test-taking attitude. m 
Investigations of faking compare scores made under instructions to a 
scribe oneself honestly, with scores made when directed to try for a go : 
score or a bad score. All these studies demonstrate that faking is possible; a 
need cite only two representative findings. Longstaff gave the Strong ape 
Kuder tests to students with the usual instructions, and then asked ber T 
try simultaneously to fake a particular pattern: high on certain keys and d 
on others. The results, some of which are given in Table 59, indicate t? 


TABLE 59. Percentage of Male Students Able to Fake Strong and Kuder 
Scores Successfully 


Scores 
"Faked Upward" 


Percentage reaching 


83 
A on Strong keys Carpenter 9 Chemist 91 Artist 86 Author 
Percentage reaching 
75th percentile on -— 51 
Kuder keys Mechanical 32 Scientific 5 Artistic 83 Literary à 
3 
Difference —23 86 3 
Scores 
"Faked Downward" 
Percentage reaching ] n 54 
C on Strong keys Accountant 26 Life insur- 20 Personnel 37 Office mc 
ance manager 
sales 
Percentage reaching 
25th percentile on sedl A 
Kuder keys Computa- 30 Persuasive 70 Social service 70 Clerica 13 
tional 
Difference —4 —50 —33 
the 
" n 
both tests are fakable. On the whole, it is easier to fake high interests 


Strong and easier to fake a 


n! 
version on the Kuder. The several keys a° 
equally fakable, Wesman ( 


: tory 
: 1952) gave the Bernreuter Personality yin 
with the following instructions: *I want you to pretend that you are apr bee? 


for the position of salesman in a large industrial organization. You hav to 


uc: 
unemployed for some time, have a family to support, and want very p : 
land this position. You are bein 


J. 
anag? 
Please mark the answers 


en” 
tory was filled out “as if 
small town,’ 


8 given this test by the employment ™ inv 
you would give." The next week, the D in 
à you were applying for the position of librar! s Fig 

The scores on the two occasions differed spectacularly, “l 


ali 
ure 77 shows. Studies such as these prove beyond dispute that person 
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tests can be falsified, no matter how constructed. Probably most applicants 
give more honest answers than did the students in these experiments, but 
the fact remains that the dishonest applicant can probably beat the test. 


Salesman 


low Confidence Confident 


Librarian 


Low Confidence Confident 


FIG, 77, “Self-confidence” scores of the same students when playing the role of applicant for 


Sales and library positions (Wesman, 1952). 


3. Some Strong items are obviously related to certain LN and nm pate 
A ess direc 
qre usually given high ;ahts than the subtler items which have a 
gher weig 3 t to fake interest in a 
relationshi ds that when subjects attempt fo 
p. Garry (1953) finds à lv. Most of them are un- 
Particular field th the "obvious" items correctly. Most o 
ey answer the i ith the criterion is 
able to fake he items whose correlation wi 
su fully on the ! i 
lower and less aaus Would it be a good idea to base Strong scores entirely 


9n the latter items in order to thwart fakers? 


OVERCOMING DISTORTION IN SELF-REPORT 
Establishment of a Codperative Relationship 


(dn any interview or personality test, the psychologist appeals for cobpera- 
tion and employs his skill as best he can to produce rapport. But rapport is a 
Complex interpersonal relationship, depending on many factors other than 

© tester’s technique. Never may the tester safely assume that he has estab- 
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lished the ideal relationship which will cause the subject to want to yeli 
whole truth.” The tester's choice of questions and his subtle modifica i we 
the testing situation may cause the subject to shift back and forth ved i 
cealing to confessing, but there is little chance that he will come F “a 
“objectivity.” Moreover, as psychoanalytic interviews make ee A 
ject’s memory of his own life is so distorted by his emotional anal a 
long therapeutic sessions are required before he can bring to conscio 

some of the important facts about himself. 


Constraint upon Responses 


s 
Variation in response style is reduced or eliminated by forcing wi tae 
to respond to the same issue. The most common technique is the A adi 
choice" item seen in the Kuder inventory. Instead of usin g the om Fo 
biguous categories L, I, and D, Kuder asks which of three activities peti 
ject likes best. Now the subject cannot say he likes everything, or peu id 
information by checking “Indifferent.” The forced choice demands in 
tion regarding Specific attitudes, traits, 


and interests. 
Forced choice is especially useful 


às a means of reducing facade t 
The popularity ("social desirability") of each statement can be SE all 
ina preliminary study. The test Constructor then forms sets of eco ; 
having equal desirability. This increases the amount of information pee e: 
as can be seen from the following example. Three interest items mig 
Percent of Subjects Saying “Like 
Watching Western movies 90 


Driving in the country 


90 
Bird-watching 


10 


er ial 
Administered Separately, these would be inefficient for detecting nn 
differences, since 90 percent of the subiects give the same answer. If m Jide 
Were paired with bird-watching in a preference item, results would ien 
better; at least 80 percent of the subjects would prefer movies. If the e wi 
item is paired with the equally popular driving item, the forced choic* 


EE H " ^ imu 
divide the Subjects into nearly equal groups, thus obtaining a max 
amount of information about di 


wants to “fake good” is outwitted by the force h go 
between equally desi required to describe whit o th 
t characteristic of him, and which faults he suffers ses 
Navran and Stauffacher (1954) asked student nU Each 
in order of their social desirability: The 
most to least characteristic of herse" 


: dicaté 
ankings for any nurse would indi 


d choic? 


greatest degree, 
rank fifteen need 
nurse also ranke 
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tendency to give a favorable picture of her own needs; this correlation was 
higher than .44 for the majority of nurses. A quite different result was found 
on the Edwards forced-choice inventory. In not one case did the rank order 
of Edwards scores correlate as high as .44 with the desirability ranking. An- 
other study employed a direct measure of façade or self-favorableness. This 
Score correlates .50 to .90 with Taylor's anxiety, Guilford’s coóperation, and 
similar scales which score Yes-No answers. The highest correlation among 
the fifteen scales of the Edwards forced-choice instrument is .82 for the En- 
durance score (Edwards, 1957; R. E. Silverman, 1957). Although the Ed- 
wards scale forces the subject to give a profile with some low points, it is by no 
means proof against faking. A subject who can guess that certain good quali- 
ties are more important than others in decisions the tester will make can dis- 
fort his responses to earn high scores in those qualities and low scores in qual- 
ities which seem unimportant on this occasion. 

The forced-choice method has its defects. It requires more time to obtain 
an equal number of responses. It is sometimes resisted by subjects who ob- 
Ject to its “Have you stopped beating your wife?” character. And it may re- 
duce the validity with which the test predicts external criteria, for reasons 


discussed below. 


4. What sort of personality would you try to show if “faking good” 


a. in a test to select Boy Scout leaders? 2. 
b. in a test to select scientists for advanced training? 
€. in a test for psychiatric ward attendants? 


Empirical Validity of Forced-Choice Instruments. In some inventories, elimi- 
nating response sets eliminates the significant, criterion-related information 
from the scores, As mentioned earlier, the F scale measuring fascist tenden- 
“les seems to be largely influenced by acquiescence. This seemingly ir- 
relevant influence may be desirable. Tendency to accept extreme statements 
May in itself be a symptom of an authoritarian outlook; if so, a forced-choice 
Scale which ruled out differences in acquiescence would eliminate important 
*vidence (Christie et al, 1958; Gage & al., 1957). 

Our present knowledge about acquiescence, facade, and other response 
Styles may be summarized as follows (Cronbach, 1950): 

* General response styles obscure descriptive information. À person who 
Says that he likes nearly everything tells us little about his particular interest 
Patterns, 

* Response sets can be modified by changing the directions. They are 
on to some extent transient, and irrelevant to the intended measure- 

ent, 

, 9 Response styles often correlate with practical criteria, but the correla- 
tions are not high. 
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© The response style involves three types of variation: transient and wc 
reliable attitudes, stable patterns which reflect criterion-relevant aspects 0 
the personality, and stable but psychologically unimportant verbal m 
The forced-choice technique eliminates all three types of variation so tha 
scores depend only on reactions to the content of items. 

e Reliability is decreased by shifting to the forced-choice form, because 
choice is made difficult. The forced-choice instrument is ordinarily a purer 
measure of the criterion-relevant 


PTS n irr er- 
qualities in the test, because irrelevant v 
bal response h 


abits are eliminated. Changing to the forced-choice form pi 
or may not raise predictive validity because the loss in reliability may offse 
the gain in relevance, 

The foregoing statements seem to vacillate between regarding resp ni 
styles as beneficial and regarding them as an interference. The stimme 
are reconciled when we recognize that the effect of response styles depen : 
on test length. According to the argument developed on page 130, the a 
of length depends on the purity of a test. A short impure test capitalizes 
upon the limited predictive validity of response styles and ordinarily giV€ 
a higher correlation With a criterion than does a short forced-choice test 
When the number of items is very large, the purer forced-choice test is il 
valid (Osburn, Lubin, Loeffler, and Tye, 1954). 

When the subject is motivated to give a favorable report on himself, eve 


a short forced-choice questionnaire js likely to be advantageous. nen 
opinion or interest questionnaire p» aped 
Since "every man has e 
usually considered es 


» faking is not generally a serio : 
à right to his own opinion," no one set of answer is 
pecially desirable. Unless the subject has a — 
his views, he will answer frankly. “The resp ae 
ose of the test and the psychologist's understan "e 
€ respondent to read the psychologist's Pep ST) 
the topies would surprise him" (Campbell, 1 


Concealing the Purpose of the Test 


Some tests of 
ment. More co; 


at 


4 for the Subject to fake when he does not know wh s- 
tester is looking for 


picious and defensive 


in his res onses. 
An effective method : 


oh 
whic 
of concealment is to state a plausible purpose y 
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is not the tester's real center of interest in giving the test. The F scale, for in- 
stance, is on the surface an inventory of opinions, but it is used to draw con- 
clusions about the underlying personality. 1f the content of a test is such that 
subjects regard it as a measure of some ability, faking can be reduced or even 
eliminated. Campbell (1957) discusses use of measures of knowledge or rea- 
Soning ability as disguised measures of attitude. Another type of disguise uses 
questions having one ostensible content but employs a scoring method which 
has little or nothing to do with that content. One investigator asked boys to 
check which books they had read, seemingly to measure reading interests. 
Actually, he had inserted fictitious titles in the list, and the number of such 
titles checked was taken as one indicator of deceit or boasting. 

While disguising one’s purpose may be effective, it skirts the edge of un- 
ethical practice. And, as one writer has commented, to try to prevent decep- 
tive subject behavior by becoming deceptive oneself merely encourages the 
View that psychologists are tricky, and in the long run may drive subjects to 
even greater degrees of evasiveness. 


Verification and Correction Keys 


The Kuder interest inventory has a special verification score, obtained by 
hich are rarely chosen. A 


Counting the subject's responses to certain items W. 
Subject who made a large number of these rare responses probably answered 
the items without proper concentration. This by no means detects all types 
of distortion, but it is of value in group testing. Some subjects are too little 
motivated to make the many preference judgments seriously, and there are 
even some who lose interest and simply mark at random from that point to 
the end of the test. 


The Edwards in ventory uses 210 forced-choice pairs of statements. Fif- 
te . - e at random intervals within the test. A 


ent gave the same answer on the 


koe Ck scores, including façade or socia > 
a styles (e.g., a count of evasive 
a Multiphasic). 
is Es check score may be used most a 
Wo 9 possible to apply statistical correc : 
uld have been obtained with a normal response style. 


annot say” responses in the Min- 


mply to eliminate suspect records. It 
tions which estimate the score that 


* IF a high-school senior earns a suspiciously high verification score on the Kuder, 


What should the counselor do? 
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ALTERNATIVE INTERPRETATIONS OF RESPONSE CONTENT 


No matter what special procedures are used to reduce distortion, the test d 
sponses depend upon how much of the truth the subject is willing and able 
to report. Interpretation must take this fact into account. 


Interpretation as True Self-Description 


Simplest but most hazardous is to interpret the responses as a frank report 
of the subject’s typical behavior. If the relationship between tester and sub- 
ject is such that this is a reasonable expectation, then no subtleties of test de- 
sign are required. »- 

Complete frankness cannot be anticipated in any situation where the su : 
ject will be rewarded or punished for his response. Some degree of — i 
and punishment is implicit in any institutional use of tests, such as clinical : 
agnosis or employee selection. Honest self-examination can be hoped for on y 
when the tester is helping the subject to solve his own problems, and en 
then the subject may have a goal for which he wishes the support of th 
counselor’s authority, which biases his response. 


Interpretation as “Published” Self-Concept 


It is more reasonable to inter 


public self-concept than as a statement of his typical behavior or of his gel 
vate self-concept. To be sure, his public self-concept should correspond P? 
Some measure to his behavior, but the ambiguity of test items and the inev 
table distortion in self-observation reduce this correspondence. 

A historian, examining a diary written by a long-dead statesman, refuses 
assume that the statements made therein are true reports of the man’s ae 
liefs and feelings. Unless there is considerable evidence that the docume? 
was a private one never intended for the light of day, the safest assumptio? 
is that the statements represent the image the man wished to leave in ^ 
tory. The psychologist likewise can regard the responses of his subject as 4 
"published" self-concept, a statement of the reputation the subject wou 
like to have. 


Sometimes this information m 


an individual is unable to admit certain kinds of tabooed impulses may " 


H t's 
pret the report as a statement of the subject’ 


to 
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that the person who presents so perfect a facade on the test maintains a simi- 
lar facade in all his social relations. The façade of perfect control and free- 
dom from impulse is a brittle one and can be maintained only at consider- 
able emotional cost. Hence the facade itself has diagnostic and prognostic 
Significance. 

The person who admits to certain emotional problems may also be build- 
ing up a public image. These may not be the most important problems of 
Which he is conscious. It is commonly observed in psychotherapy that people 
do not bring out their main problems until several interviews have passed. 
When à person admits to problems which call for counseling, his report is an 
Invitation to open counseling with an examination of the area mentioned. 
He is saying, first of all, that he is willing to be counseled; second, that this 
area is one which concerns him but is not too sensitive to be discussed. His 
Most serious conflicts may be completely concealed by his questionnaire re- 
Sponses, but if he is unwilling to admit these conflicts he is probably also un- 


Willing to deal with them immediately in psychotherapy. 


SA questionnaire is filled out by all parents belonging to a study group, as a 
Means of identifying problems to be taken up in group discussion. Mr. Smith 
checks many problems having to do with developing the child's honesty, respect 
for the Property of others, and care for his own property. The school counselor 

nows, however, that his son has been in difficulty several times because of 
aggressive fighting on the playground, window breaking, and other aggressive 
offenses which have been called to Mr. Smith's attention. Can the counselor 

7. STOW any useful conclusion from Mr. Smith's self-report? . : 

` A attitude test for foremen presents hypothetical problems that might arise on 
the job and asks the subject to indicate what action he would take if he were 
foreman. Scores are based on response patterns (e.g., “takes quick action,” 

' “emphasizes cost-cutting"). What use can 


‘se Ti : , 
eks facts, ‘emphasizes morale, $ , h 
e obvious temptation to give a desirable 


© made of the responses, in view of th 
Picture? 


Ynamic Interpretation 


on clinical psychologist is unwilling to reduce personality to a statistical 
~Port of overt behavior. The clinician is concerned not with the number of 

m à person becomes emotionally upset but with the conditions under 
i this happens and the forces, internal and external, that lead to it. An 
pap, a who now becomes upset once per month might become chroni- 
^k disturbed if conditions changed in a certain manner. In this event, a 

tage average of his past behavior would haye almost no predictive 
in 5. A "dynamic" picture of an individual is a picture of Hs forces chang- 
8 his respons tant in such a picture are his per- 


Cen e as situations change. Important : 
» Ptions of the people he deals with, his feelings about himself, and the 


“eds which he ig trying to satisfy, If the clinician has insight into these hid 
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den characteristics, he then has some hope of predicting reaction to particu- 
lar opportunities or stresses. 

Cates conclusions about personality dynamics even from an e i 
series of interviews is difficult, and a brief test or series of tests can only e 
fer hypotheses of questionable validity. Tests are commonly used as » ien 
for dynamic interpretation, however, and highly experienced and mega pa 
interpreters can use them profitably. The basic assumption of any — 
interpretation is that every act of behavior is meaningful, even when it is A 
consistent with other observations. The task of the interpreter is to seek som 
underlying unity which resolves the contradiction. — 

Dynamic interpretation requires extensive data about the individual's e t 
vironment and difficulties as well as his test score, and considerable know 
edge of personality theory. It can be demonstrated in a brief example, EN 
an analysis by a University of California counselor. Barbara Kirk (1952) ba 
scribes the pattern often found among academic failures or near-failures wh 


ary i$ 
do well on aptitude tests given at the Counseling Center. (Our summary 
drastically condensed. ) 


" istic, 
The explanation and the excuses for the academic deficiency are unrealis 


anat s 
; ni- 

superficial, and largely implausible. The counselee demonstrates no real reco£ 
tion or admission of the reasons for this deficie 


sions regarding the Minnesota Multiph 
cases are; 


Most frequent is "psychoneurosis with compulsive and depressive di 
Such [persons] tend to be pervasively resistant on an unconscious level to A 
externally imposed task. Since childhood, however, they have concealed ne 
resistances from themselves and others by a facade of hard-workingness, me 5 
lousness, and earnest dutifulness. In the unstructured environment of a uni 
the loss of the continued external pushing of teachers and parents permits the © 


; ifes 
throw of the process of grudging achievement, and the resistances then man 
themselves in nonperformance, 


The academic failure probably has me 


ion 
4 isfactio 
aning in terms of unconscious satisf 
of the hostility usu 


ally directed towards some member of the family who demam 
€nt scores on tests taken in a counseling situation mt ese 
Cause no importance is attached to er- 
hem as he wishes. It is a declaration, P 
nificance of his academic failure. 

an be seen that such interpret 
hey alert the counselor to con 


tests, the counselee is free to do with t 
haps, of the lack of sig 


Itc 


. lation: 
ations involve considerable specu 
but td 


. Jing: 
flicts that may emerge during counse 
INTERPRETATION OF RESPONSES AS DIAGNOSTIC SIGNS 


if 
l ided 
The risky assumption that the subject is telling the truth can be am F 
we interpret his response, not as self-description, but as an act of ver 
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havior that is correlated with his inner nature. These two approaches have 
been characterized by Florence Goodenough as the "sample" and "sign" ap- 
Proaches, In regarding responses as samples of behavior, we use transparent 
items and pay primary attention to the content of the responses. We are at 
the mercy of the subject who wants to mislead us. If we plan to regard re- 
Sponses as signs, we can use items whose surface content is irrelevant to what 
We wish to measure, and even distorted responses may have diagnostic value, 

The Strong blank, as originally employed, is based on the “sign” principle. 
Strong did not include activities in his Engineer key because they are part of 
a €ngineer's work; he asked only whether the response was characteristic 
of engineers, By counting dozens of such “signs,” he distinguishes men who 
resemble the typical engineer from those who have little in common with en- 
Bineers, The interpretation that a person belongs in a category is made on a 
Strictly actuarial basis. Strong can say, “Persons with this combination of re- 
Sponses tend to become engineers" in the same way that an insurance exam- 
iid might Say, "People with this combination of weight, blood pressure, and 
leart condition rarely live beyond 70." In strict actuarial interpretation, the 
ester makes no pretense of a rational connection between a particular re- 
Sponse and the criterion. Engineers have greater than average liking for The 

ational Geographic Magazine; prediction can be based on this fact whether 

T not any psychological significance can be attached to it. 

The actuarial approach eliminates the assumption of honest self-report, 
The question “Is your health better or poorer than average for your age? 
does hot obtain valid facts about health. One person overrates his health in 
“Porting, another who has only minor ills exaggerates them. If clinically di- 
Agnoseq Neurotics reply “poorer” more often than do normals, this answer 
May be diagnostic even when it is *untrue"—in fact, it may be diagnostic 
Just because it is untrue. Empirical scales take the “attitude that the verbal 
TPE of Personality inventory is not most fruitfully seen as a ‘self-rating’ or 
&scription whose value requires the assumption of accuracy on the part 
Bier testee in his observations of self. Rather is the penar M : test item 
regards, an intrinsically interesting segment of br E ^ "m 
Material which may be of more value than any : e cie hs "Rom 
Pocho esis which the item superficially i jg ; a ddl um hy- 
Says iia "lac says that he has ‘many headaches’ the fact of interest is that he 

e (Meehl, 1945, p. 9). . 

Pects, “mpirically scored test can be used for purposes ma niteat never sus- 
Combin e Strong ostensibly assesses vocational ue ut one eui ing key 
Culini es those items which men answer differently rom women into a “mas- 
nists ec “emininity” score, It is presumably possible to distinguish commu- 
tom ic noncommunists, or girls who are likely to marry and stop working 
likewise se who are likely to remain in an occupation. The inventory could 

be keyed to distinguish juvenile delinquents from nondelinquents, 
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iti eports 
ological quality differs on other qualities. de 
ified to gain social appr 


; " disap- 
» Which carry no connotation of approval or 
proval, permit valid indirect 


certain to eliminate distortions, 
show. Basing the Scoring weigh 


are “obviously” related to 
example, in taking the Strong bl 


: l 
items, we would perhaps be a 
€rson who has a high score petiit 
9ne who is high only on the o 

This suggestion w. 
Wiener ( 1948; see 


TABLE 60, Correlati , 
ures of Facade Eten of MMPI Scores with Meas- 


Correlations with Façade 


core for 
Scale Obvious Subtle 


tems Items 
Depression 


Sychopathic deviate es 

aranoia 

anic 
Hysteria 


Ss of empirical tests 
the validation experiments, owes igned to? All- 
suggesting that the original validation in a a i e samp jo? 
port (1987, p. 329) Protests against 4 Scal Tn E Pii iun assoc} 

Breen to the stimulus « SS" is Scored +6 ia (eee icd to the 8° 
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depend; Primarily on the adequ^ ^ s, 
Sometimes ab, 
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Loyal boys might have given this response more often in the sample tested 

than disloyal boys, but it is implausible that the same result would be found 

in further studies. Large and representative samples are crucial for establish- 

ing empirical keys. At best the validity of empirical keys is only moderate, 

Since they rely on indirect information. Because they use subtle items they 

a less readily explained to clients than are tests using content interpreta- 
on. 


8. Distinguish in each of these cases whether the investigator is assuming that self- 


reports are truthful: i : 
a. The clinical symptoms of condition x are determined by observation. A list 


of Symptoms (swollen feet, rash, etc.) is prepared. This list is used to de- 
termine how frequent condition x is in several localities. Each subject is 


asked to check whatever symptoms he has. . : i 
b. There are three general stages in social development in which a child names 


as favorites (1) other children withouf regard to sex, (2) persons of his own 
Sex exclusively, (3) persons of the opposite sex. The investigators ask a child 
to name his favorite playmates as a means of determining his level of de- 
velopment. f 

C A Psychologist administers to a group of applicants a checklist in which each 
marks the adjectives that describe him. The success of these men is observed, 
and a record is made of the characteristics checked by the successful ap- 
Plicants but not by the others. This checklist is then given to further applicants, 
and those who check the same characteristics as the previously successful 

9. wy een are hired, l 

St use could be made of a scale predicting what girls are likely to marry? 


ET 
HICAL ISSUES IN PERSONALITY TESTING 


ee testing has flourished in two contexts, one institutional, the other 
Breat à Valid information about personality would presumably be of 
Cision, ue to employers, college admissions officers, and others who make 
®pplieg : to carry out institutional policies. In fact, personality tests were first 
t 9 screen potentially neurotic soldiers. Such institutional testing tries 
“termine the truth about the individual, whether he wants that truth 
Wn OF not, In noninstitutional testing, tests are applied for the benefit of 
val Person tested. Here also the tester believes that learning the truth will be 
€ but does not feel free to violate the person’s wishes. The client who 
in 75 With an emotional difficulty wants the psychologist's assistance, but he 
E ? quite unprepared to pay the price of unveiling his soul. 
Vea y test is an invasion of privacy for the subject who does not wish to Tis 
testin “nself to the psychologist. While this problem may be encountered in 
nowledge and intelligence of persons who have left school, the per- 
Even, 7 test is much more often regarded as a violation of the subject's rights. 
oe has two personalities: the role he plays in his social interactions 
= Rutsche" To a culture where open expression of emotion is discour- 


458 ESSENTIALS OF PSYCHOLOGICAL TESTING 


or potential suicides from nonsuicides. The basic principle is that any ae 
which differs on one psychological quality differs on other qualities. Repor 3 
on some of these qualities are likely to be falsified to gain social UE à 
The remaining qualities, which carry no connotation of approval or i d 
proval, permit valid indirect measurement. Actuarial scoring is by no a 
certain to eliminate distortions, as the faking studies on the Strong blan 

show. Basing the scoring weights on empirical connections makes it no 
difficult for the subject to guess what significance will be attached to his 
Statement that, for example, he likes to read the Geographic. der 

Within an empirically scored test, one can distinguish between items tha 

are “obviously” related to a key and those whose connection is indirect. F en 
example, in taking the Strong blank a boy trying to fake a high score as pi 
neer could be expected to indicate a marked liking for mathematics a 

technical subjects; these are “obvious items.” He would be less likely to reat- 
ize that interest in The National Geographi 
this item may be classified as “subtle.” If w 
keys, one for obvious and one for subtle ite 
make much more valid distinctions. A pers 


keys is more surely like an engineer than on 
ones. 


c is characteristic of engineers; 
€ were to make up two scoring 
ms, we would perhaps be able to 
on who has a high score on both 
e who is high only on the obvious 


eys were carried out. There is ale 
S keys differ in their susceptibility bo 
ets (Table 60). 


TABLE 60. Correlations of MMPI Scores with Meas- 
ures of Facade Effect 


sn aa re 


Correlations with Facade 


Score for 

Obvious Subtle 

Scale Items Items 
Depression — 78 .33 
Psychopathic deviate —.B85 27 
aranoia —J72 .06 
Manic —.53 .40 
Hysteria —71 .54 


Source: Ford: 
see also Paich ordyce and Rozynko, 


The usefulness of empirical tests depends primarily on the adequacy n 
the validation experiments. Sometimes absurd weights are assigned to e : 
suggesting that the original validation was based on inadequate sample. ^ n 
port (1937, p. 329) protests against a scale in which the word associato"; 
“green” to the stimulus “grass” is scored +6 as a sign of “loyalty to the gang: 


ee Edwards, 1957, p. 47; 
e, 1957; Hanley, 1957, ^ “awards, d 


^ 
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Loyal boys might have given this response more often in the sample tested 

than disloyal boys, but it is implausible that the same result would be found 

in further studies. Large and representative samples are crucial for establish- 

ing empirical keys. At best the validity of empirical keys is only moderate, 

Since they rely on indirect information. Because they use subtle items they 

iy less readily explained to clients than are tests using content interpreta- 
on, 


8. Distinguish in each of these cases whether the investigator is assuming that self- 


reports are truthful: 
a. The clinical symptoms of condition x are determined by observation. A list 


of Symptoms (swollen feet, rash, etc.) is prepared. This list is used to de- 
fermine how frequent condition x is in several localities. Each subject is 


asked to check whatever symptoms he has. i 
b. There are three general stages in social development in which a child names 


as favorites (1) other children withouf regard to sex, (2) persons of his own 
Sex exclusively, (3) persons of the opposite sex. The investigators ask a child 
to name his favorite playmates as a means of determining his level of de- 


velopment, 

SA Psychologist administers to a group of applicants a checklist in which each 
marks the adjectives that describe him. The success of these men is observed, 
and a record is made of the characteristics checked by the successful ap- 
Plicants but nor by the others. This checklist is then given to further applicants, 
and those who check the same characteristics as the previously successful 


9. wW men are hired. i 
* What use could be made of a scale predicting what girls are likely to marry? 


E 
THICAL ISSUES IN PERSONALITY TESTING 


Personality testing has flourished in two contexts, one institutional, the other 
‘vidual. Valid information about personality would presumably be of 
S" value to employers, college admissions officers, and others who make 
ap Per to carry out institutional policies. In fact, personality tests were first 
P ied to Screen potentially neurotic soldiers. Such institutional testing tries 
“termine the truth about the individual, whether he wants that truth 
"i Wn or not, In noninstitutional testing, tests are applied for the benefit of 
he n tested. Here also the tester believes that learning the truth will be 
d € but does not feel free to violate the person's wishes. The client who 
tay With an emotional difficulty wants the psychologist s assistance, but he 
ds © quite unprepared to pay the price of unveiling his soul. 
Ns y test is an invasion of privacy for the subject who does not wish to vd 
i !mself to the psychologist. While this problem may be encountered in 
Sonar nowledge and intelligence of persons wbn Pase left school, the per- 
ery test is much more often regarded as a violation of the subject s rights. 
ang hi man has two personalities: the role he plays in his social interactions. 
S true self.” In a culture where open expression of emotion is discour- 
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aged and a taboo is placed on aggressive feelings, for e md "US E ili 
tain to be some discrepancy between these two personalities. T' i" - 7 à 
test obtains its most significant information by probing deeply into fee : ja 
and attitudes which the individual normally conceals. One test purports ‘ 
assess whether an adolescent boy resents authority. Another tries eer 
mine whether a mother really loves her child. A third has a score sea 
the strength of sexual needs. These, and virtually all measures of pen zi 
seek information on areas which the subject has every reason to regar "i 
private, in normal social intercourse. He is willing to admit the pud 
into these private areas only if he sees the relevance of the questions to ae 
attainment of his goals in working with the psychologist. The psychologis i 
not "invading privacy" where he is freely admitted and where he has a ge 
uine need for the information obtained, 

Some testers are regarded as “espionage agents” in industry (Otis, € i 
The newspapers have reported one case of a psychologist who developed fo 
an industrial client an inventory intended to detect applicants with str je 
prounion attitudes, so that the client, by rejecting such men, could keep e 
union weak in his plant. As the tester finds increasingly valid ways of t 
tecting what men feel and think, and as tests are increasingly imposed y 
schools, employers, and military services, there will be serious danger of a 
flict between the demands of the psychologist's employers and the rights © 
the person tested. - 

Responses have to be evaluated in terms of conformity to some ideal. de 
employer who used tests to detect union supporters dictated the attitu : 
he wanted employees to have. If it is repugnant to find a powerful on 
dictating what a citizen may say, it is unthinkable that he should have U'* 
power to punish unuttered thoughts. Yet that is what a subtle measure z 
attitudes threatens when used for institutional purposes. Defining ce 
score patterns as good necessarily makes the test a force toward conformity 
and standardization. . 

The use of personality tests for selection arouses resistance, as the Bm 
lence of faking indicates. Calls for open rebellion flare up from time to ee 
in the public press, a notable example being Th i 
challenging book of essays by William H. Whyte, 
tors of Fortune. He warns men seeking executi 
count on favorable recommendations from the 
them only if they display a 
arts, and acceptant of the 


samine 
psychologist who exam the 
; f n 
particular pattern: extrovert, uninterested Jy" 
status quo. He advises them to fake "norma 


" 
i : ns we 
. ^ Give the most conventional, run-of-the-mill, pedestrian 2! 


: ues" 
possible. When in doubt about the most beneficial answer to any d 
tion, repeat to yourself 
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I loved my father and my mother, but my father a little bit more. 
I like things pretty much the way they are. 

I never worry much about anything. 

I don't care for books or music much. 


I love my wife and children. 
I don't let them get in the way of company work. 


Whyte is certainly incorrect in describing this one interpretation as repre- 
Senting the practice of all industrial psychologists concerned with executive 
selection, but regardless of what the psychologist concludes, the firm to which 
he reports is likely to prefer the man who has "safe" attitudes. 

The Standards of Ethical Behavior for Psychologists (1958) include the 
following principles: 

€ The psychologist in industry, education, and other situations in 
Which conflicts of interest may arise among varied parties as between 
management and labor, defines for himself the nature and direction of 
his loyalties and responsibilities and keeps these parties informed of 
these commitments. 

9 [When serving the individual] the psychologist informs his pros- 
pective client of the important aspects of the potential relationship that 
might affect the client's decision to enter the relationship. 

9 The psychologist who asks that an individual reveal personal in- 
formation in the course of interviewing, testing, or evaluation, or who 
allows such information to be divulged to him, does so only after mak- 
ing certain that the person is aware of the purpose of the interview, 
testing, or evaluation and of the ways in which the information may be 
used, 


No ethical objection can be raised to the use of subtle techniques and even 
" ng instructions when the information so obtained will be used en- 
ely for research purposes, the subject's identity being concealed in any re- 
oe when the tests are intended solely for research, the tester should 
tereho a person who has other responsibilities toward the subject (e.g., his 
er or therapist) except under the conditions described below. 
hether Serving an institution or serving an individual client, the tester 


3 : 
ould not use indirect and misleading techniques unless the subject clearly 
ay be used against him.” To be sure, an 


bmit to tests as grounds for denying 
ferable to obtaining deceitfully in- 


rela that "anything he says m 
him Yer may regard his refusal to su 
“ap loyment, but this is ethically pre 
ation he does not wish to give. 
in n à clinical setting, the psychologist can likewise offer a choice, with an 
Oduction of approximately this character (see also pp. 293-296): “It 


im 
ight help to solve your problem more rapidly if we collect as much infor- 


form 
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mation as we can. Some of our tests use straightforward questions whose 
purpose you will readily understand. Some of our other tests dig more 
deeply into the personality. Sometimes they bring to light emotional con- 
flicts that the person is not even conscious of. Few of us admit, even to our- 


selves, the whole truth about our feelings and ideas. I think I can help you 
better with the aid of these tests." 


The client may refuse to take disguised tests if he is not ready to trust ins 
psychologist with full knowledge of his personality. If this is the case, the in- 
formation probably could not be used constructively in counseling him. In 
counseling it is both advantage and disadvantage that direct, unsubtle tests 
are no more than tabulations of statements made by the person about him- 


self. While they uncover no secrets, they frequently accelerate the counsel- 
ing process because the 


y represent things he is ready to discuss with the 
counselor. 


There remains the question of u 


authority over the person tes 


a as 
sing personality tests when the tester h 
tients, the milit 


ted. The psychologist diagnosing mental p* 


t his 
St, or the schoolteacher can enforce tests 0? 


d 
a 
4 ling, oj ly to both en998 t- 
their suitabili 9, given separately to + ae pa 
l iag 
btle questions?” “nd probable success as mar" 


13. The Minnesota Teach ; 
have the attitud! ie 
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à es that leid tp ,ventory 
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hers P 
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tive techniques, Krugman's composite of extracts from test reviews and his 
Statement of emerging trends are of particular interest. 

Longstaff, H. P., & Jurgensen, C. E. Fakability of the Jurgensen Classification 
Inventory, J. appl. Psychol., 1953, 37, 86-89. 

This describes an illustrative experiment on faking of a forced-choice inven- 
tory under different sets of directions. 

McClelland, David C. Roles and role models. Personality. New York: Dryden, 
1957. Pp. 289-339. 

Describing personality in terms of typical behavior is not completely satisfac- 
tory because behavior varies with the situation. McClelland illustrates and 
accounts for such inconsistency in terms of changing social roles. 

Meehl, p, E, The dynamics of “structured” personality tests. J. clin. Psychol., 1945, 
l, 296-303. (Reprinted in G. S. Welsh & W. G. Dahlstrom (eds.), Basic read- 
ings on the MMPI in psychology and medicine. Minneapolis: University of 

innesota Press, 1956. Pp. 5-11.) 

Meehl argues that self-report is undependable and uninterpretable if taken 
at face value, and defends actuarial keying as the only suitable method of 
obtaining useful insight from questionnaires. 

Whyte, William H., Tr. The tests of conformity. The organization man. New York: 
Doubleday Anchor, 1956. Pp. 201-222. 

This is a scathing critique of personality tests as used in executive selection. 
Examine also the Appendix, How to cheat on personality tests. 


EET 


Personality Measurement Through 
Self-Report 


HISTORY OF PERSONALITY INVENTORIES 


5 an 
IN INTRODUCING personality measurement we have spoken of it as 2 
attempt to assess "typical behavior." This phrase, which has served alcol 
poses to this point, echoes the viewpoint of behavioristic psychology, ¥ ristic 
is concerned primarily with overt, observable responses, The behavio sum 
outlook is somewhat limiting, however, and we can understand porsona A 
assessment better if we recognize that its development has been stro 73 
influenced by the attitudes of phenomenological psychology. Phenomeno 


individ- 
ical psychology is concerned with the way the world appears to the ee 
ual, with his so-called private world. Such expressions as self-concept, 
ings of hostility 


, and attitude toward authority refer to perceptions El 
reactions occurring within the individual. Many important Lain an 
events such as hallucinations and dreams exist only in the person's amn oi 
ness. It can be argued that almost all crises of adjustment are shaped 


by the individual's perception of events than b 


Asa 
y the events themselves. 
consequence, m 


"optive 
any psychologists are more concerned with the subject" 
reactions of the person than with his outward responses. study 

The first personality questionnaires were developed in an attempt to 1880's 
the inner world of perception and feeling. Sir Francis Galton in a cou 
devised the technique when he needed a standard procedure whic pe o 
be applied to numerous subjects for his studies of mental m. doe nine- 
questionnaires, again for research purposes, was extended later in develop- 
teenth century by G, Stanley Hall in his vast studies of prx riri" jineste 
ment. He used information given by large samples of adults to de 


$ oe i individ“ 
normal trends in development, being little concerned with single 
uals. 


[n 
; " " men. 
The questionnaire served rather different functions for the two 
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Galton's work, self-report was used as the only possible way to obtain infor- 
mation on events within the respondent's head. Hall's self-report was used to 
avoid the labor and delay involved in direct observation of behavior. 


1. Which of these problems or topics of research falls within behavioristic psy- 
chology, and which within phenomenal psychology? 
a. How frequently does mood change in the typical woman? 
b. How much does speed of reading decline in the presence of loud, con- 


tinuous noise? 
€. Do managers and workers describe the company policy on selecting workers 


for promotion in the same way? 
d. What is this child afraid of? 


Adjustment Inventories 


The first inventory primarily concerned with assessing the individual was 
the Woodworth Personal Data Sheet. The U.S. Army, at the beginning of 
World War I, wanted to detect soldiers likely to break down in combat, but 
individua] psychiatric interviews were not practicable when recruits were 
Processed by the thousand. Woodworth made a list of symptoms such as 
PSychiatrists would touch upon in a screening interview and presented the 
list as a questionnaire, This pencil-and-paper version of the interview pre- 
Sented questions such as a psychiatrist would ask: "Do you daydream fre- 
quently?” “Do you wet your bed?” etc. It differed from the interview only 
m that the sensitivity of individual questioning was sacrificed for speed. Men 
Who Teported numerous symptoms were singled out for further examination. 

he test was valued because it had appreciable power to detect maladjusted 
Soldiers in à situation where individual interviewing of every man was to- 
tally out of the question. . : 

The Woodworth scale was a forerunner of a number of "adjustment inven- 
tories,” which consist primarily of lists of problems, symptoms, or grievances 
to be checked. These instruments make little claim to subtle description of 
Personality, often yielding only a single score representing level of adjust- 
ment, Sometimes only one type of symptom is emphasized, as in the Cornell 

€dica] Index covering psychosomatic complaints. Sometimes the items are 
Broupeq by logical categories, as in the Bell Adjustment Inventory, which 
AAS scores for home, health, social, and emotional adjustment based (re- 


S " 
Pectively ) on items such ast 


lite ious tests and are used by permission of 
th ms quoted in this chapter come from various tes i 
pS SoDyright hol ders Bell Adjustment Inventory, copyright 1934, 1938, 1959, Consulting 
194g logical Press; Minnesota Multiphasic Personality Inventory (MMPI), copyright 
Ten? University of Minwesata, published by The Psychological Corporation; Thurstone 
alit Perament Schedule copyright 1949, Science Research Associates; Minnesota Person- 
Veri cale, copyright 1941, The Psychological Corporation; Minnesota Counseling In- 


tation? Copyright 1953, University of Minnesota, published by The Psychological Corpo- 
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Has either of your parents frequently criticized you unjustly? 
Are you subject to eye strain? 


Would you feel very self-conscious if you had to volunteer an idea to start a 
discussion among a group of people? 
Do you get discouraged easily? 


An adjustment inventory consists of items that differentiate subjects known 
to be maladjusted from subjects judged normal. 

One principal use of such inventories is to identify those who should be 
offered counseling. While “problem cases" who cause trouble are easily 
recognized, children and adults who are withdrawn and insecure may not 
attract the attention of observers. An adjustment inventory brings to light 
many of these cases. Simple though the inventory may be, it can play 4 
valuable role in large guidance programs. Some indication of the demand for 
such aids is the fact that one modest inventory reported, after ten years of 
distribution with no special advertising, that half a million copies had been 
sold. 

Adjustment inventories are best regarded as screening instruments which 
single out persons who freely check symptoms and self-criticisms. They are 


not definitive measures of any clearly defined trait; such information as they 
provide is superficial at best. 


Trait Descriptions 


During the period from 1990 to 1945, psychologists were largely behavior- 
istic in outlook and unwilling to base conclusions on the individual's intro" 
spections. The inventory was thought of as primarily a substitute for ob- 
servation of behavior, and the questions placed more emphasis on what the 
individual did than upon how he felt or what he thought. The questionnaire 
was broadened to describe as many aspects of behavior as possible, and re- 
sponses were summarized by giving scores on a number of “traits” OY xe" 
Sponse patterns. Personality was conceived during this period as a bundle 9 
habits. The individual was described by the strength of such traits as friend- 
liness, confidence, persistence, etc. A “strong” trait was one describing ? 
response which he usually or frequently made. d 
In early inventories this list of traits or behavior categories to be score 


wa: itrari i ye 
s arbitrarily chosen. Some traits such as self-confidence came from yim 
m i ; 
ine iar and some such as introversion from personality pem 
zens of i i : 7 
pi instruments were produced, each taking items from its predec 
sors, adding a few new one 


: he 
: | s, and scoring them in new combinations. T 
est-known instrument of th 


H H i E 
EE is period was the Bernreuter Personality t 
E ry, ^ : the Bell Inventory in form but using more varied question 

$ scored for Neurotic Tendency (i.e., adjustment), Self-Sufficiency> m 
version, and Dominance, i 


PERSONALITY MEASUREMENT THROUGH SELF-REPORT 467 


A study of this scale by Flanagan marked the introduction of trait scores 
defined according to statistical rules. Flanagan adopted the principle thai to 
deserve Separate names, traits must have low correlations. He intercorre- 
lated the Bernreuter scores of 305 adolescent boys (Table 61) and found 


TABLE 61. Intercorrelations of Bernreuter Scores for Adolescent Boys 


Neurotic Self- . : 
Tendency Sufficiency Introversion Dominance 


—.69 
Neurotic Tendency T E 5 
Self-Sufficiency à —.62 
Introversion 
Dominance 


Sounce: Flanagan, 1935. 


the traits by no means independent. “Introversion,” as there measured, is 
little different from “Neurotic Tendency,” since items on social isolation and 
aydreaming carry large weight in both scales. Applying factor “i. 
lanagan found that Confidence and Sociability scores could account ori e 
information carried by the four original keys, and he developed scoring keys 
for these traits. The scores correlate negligibly and thus do represent inde- 
Pendent as :lf-report. , 
There Mss icem D personality theory was wholly subordinated 
9 à statistica] search for *dimensions" which could summarize personality. 
tem intercorrelations led Guilford, for example, to suggest that introversion 
“ould be Separated into social introversion (S), thinking introversion (T), 
pression (D), cycloid tendencies (frequent shifts of mood) (C), and 
restraint (R). Accordingly, he developed the Inventory of Factors 
S-T-D.C-R Later he added eight more aspects of personality. The Guilford 
Scales Were not uncorrelated (resembling verbal and numerical reasoning 
Scores in this respect). Other investigators therefore rearranged them into 
“coring Patterns which they regarded as more efficient. Thurstone, for in- 
Stance, accounted for much of the information in Guilford’s thirteen scores by 
Seven actors which he renamed reflective, sociable, emotionally stable, 
Vigorous Or masculine, ascendant or dominant, active, and impulsive. This 
Same jg interminable. ns psychologist classifies the items finely, the second 
mh Some of the small bundles together, the third redivides the large bun- 
ae ina new way—and each gives his own names to the factors, bic trait 
bs S can be tied down to a definite theory or to external criteria, choice can 
ine © only on aesthetic grounds. There is at present no beige e» 
the Or analysts as to the number of factors that have been reliably identified, 
Bst Organization of them, or their most appropriate memes, lem 
E rait Dames, we may note in passing, are a BOUNCE of reis pee in 
thay sonality field. The meaning of “introvert” is twisted and turned so 


i : : 
represents for one author a brooding neurotic, for another anyone 
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who would rather be a clerk than a carnival barker. "Ascendance" € 
from spontaneous social responsiveness, in one theory, to apap dada 
overbearing behavior in another. The verbal coinage has been So de " ae 
by popular usage and by questionnaire makers that some en ped 
to free themselves by coining completely new terms. R. B. Cattell ü : d 
has succeeded in popularizing his word surgency to describe a centara pat 
tern of energetic behavior, but he will surely encounter considerable v 
sistance to such new-minted trait names as parmia, premsia, and — 
(akin respectively to social extroversion, emotional sensitiveness, oper 
to intellectual culture). In the present Babel of trait names, the only use " 
Way to discuss personality test data is to speak of “Guilford’s Ascendanc 


^ P din 
score," “CPI Dominance score,” or “Thurstone Ascendant score,” according 
to the measure used. 


2. Does the Woodworth inventory employ the 


" h? 
"sign" or the "sample" approac 
3. Some testers have treated Flanagan's two s 


the 
coring keys as supplements to 


: 3 3 :ndividual. 
reporting all six scores to describe an individ 
S practice. 


ability? 
at Henry falls at the 50th percentile in Sociability 
ed in terms of a "habit"? 


Criterion-Oriented Tests 


i s ton” prin- 
Construction of personality questionnaires according to the “sign P 
ciple used by Strong has been rare, chiefly because criteria in the personality 


us point of departure is pp 
. Humm developed the Humm-Wadsworth Tempa a 
such groups as manic and pu 
industrial psychologists given spec! 


x was 
» and little of the research done with the test 
reported, 


: me 
Essentially the same approach to test construction and many of the sa 
it 


ems were used in the Minnesota Multiphasic Personality Inventory. pae 
scale, published in 1942, was very rapidly accepted and remains today h 
most widely used and most widely investigated of questionnaires. Ahang 
strictly empirical in its original conception, it proved to be relatively unl 
fective in allocating patients to diagnostic groups. The test has, how ce 
grown in prominence because accumulated research and clinical exp sane 
permit the tester to interpret scores. It will be discussed at length ei 


key 

s; a 

7. 9e students one might osa pee 
inguish campus leaders from nonleaders. 

rk important Personality dimensions. 


an empirical scale for colle 
ems that dist 
other criteria that might ma 


In developing 
consisting of jt 


^h 
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8. An investigator believes that teachers can be characterized by the trait "con- 
tent-centered” vs, "child-centered." Outline the procedure needed to make a 
self-report test by means of criterion keying. 


Tests Derived from Personality Theory 


Whereas both the factor analysts and the empiricists developed tests by 
blind groping to find what-correlates-with-what, the more recent trend in 
Personality measurement is to define constructs on the basis of personality 
theory and to prepare items specifically to elicit information about those 
Constructs. This is not wholly new; indeed, the earliest work on introversion 
Was stimulated by Jung’s personality theory. But that theory had little in- 

uence on the actual tests, beyond suggesting items for trial. Today, con- 
Siderable research is going into the Myers-Briggs Inventory, whose items 
and Scoring keys are explicitly dictated by Jungian theory. Other instruments 
Which illustrate this trend are the Edwards Schedule (which derives from 
the Murray theory of needs), the Taylor Manifest Anxiety Scale (designed 
M connection with research on Hull-Spence behavior theory), and the Cali- 
oria F scale for identifying “authoritarian” personalities. The theoretically 
Oriented instrument often is confined to one single trait. To validate a test as 
à measure of even one construct requires extensive and painstaking research, 
and it is 4 brave investigator who tries to advance on more than one theoreti- 


cal fr 
al front at a time. 


9. How many scores would you consider necessary to give a complete picture of 
Personality? 


1 H " H . 
9. The following items are taken from various personality inventories. Is any ap- 
Parent Purpose served by the alterations in form and wording? 


Did you ever have a strong desire to run away from home? 
Yes No (Bell Adjustment Inventory) 
At times | have very much wanted to leave home. 


1. | True False Cannot Say (MMPI) " 
` S any apparent purpose served by these changes of form and wording? 


Are You at ease in a large group of people? 

l 85 No (Thurstone Temperament Schedule) 
am a good mixer, 
True False (Calif. Personality Inventory) 


° You like to mix with people socially? 
Imost always Frequently Occasionally Rarely Almost never (Minn. 


Srs. Inventory) 


b 
"SCRIPTION OF THE MMPI 


he M; 
or Minnesota Multiphasic Personality Inventory (MMPI) holds a place 
ng Personality questionnaires comparable to that of the Strong among 
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interest measures. It was constructed in a similar empirical manner e 
was subjected to exceptionally thorough research by its authors. It nai a 
at an opportune time, and great reliance was placed upon it during the P 
wartime and postwar expansion of clinical psychology. It contributed 7 ies 
benefited from the postwar interest in clinical research, and as a result nd 
been studied more adequately than any other personality test. There pr 
titles included in a bibliography covering MMPI research through 1 : 
at that time, the number of MMPI studies was 100 per year and the rate wa 
still increasing (Welsh and Dahlstrom, 1956). h- 

The MMPI was originally constructed by a psychologist, Starke e E 
away, and a psychiatrist, J. C. McKinley, to aid in diagnosis of clinica P : 
tients. A collection of 550 items was prepared by borrowing from older Fs 
ventories and rephrasing diagnostic cues used by psychiatrists. Among the 
items to be answered "T," “F? or “P” (cannot say) are these: 


Ibelieve I am being plotted against. 

It takes a lot of argument to convince some people of the truth. 
I wish I could be as happy as others seem to be. 

Idrink an unusually large amount of water every day. 


The content of these items is quite diverse. Some report observable behavior, 
some report feelings that could not be observed from the outside, and prs 
express general social attitudes. Some items frankly report symptoms of a 


t r- 
normal behavior, whereas others appear to have no favorable or unfavo 
able connotation. 


Scoring Procedures 


Psychiatric Discriminant Keys. The scoring keys were developed with the 
intention of identifying patients 


with respect to such recognized payne 
states as hysteria. Patients of each type were compared, item by item, m" 1 
so-called normal group drawn from visitors coming to a large city hospita 
Items which distin 


. Pa 
guished paranoids from normals were counted in the ci- 
(paranoid) key. Paranoid patients tend to say "True" to the first of our Spe 
men items ("plotted aga 


inst”) and it is included in the Pa key. The pep " 

item ("argument to convince some people”) seems to imply a paranoid x: 

Sistence on one's own ideas, but it does not differentiate paranoid estt 
from normals and is not in the Pa key. Instead, the evidence shows that * 

sponding F to this item is indicative of hysteria. -anted 
The contrast between the MMPI “sign” approach and the content-orie? d 
approach of its predecessors is illustrated by the fact that certain items 

the MMPI are also fou: 


but at? 
/ nd in the Guilford homogeneous scales 
scored in the opposite direction. 
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For example, to say that most people inwardly dislike putting them- 
selves out to help others, that most people would tell a lie to get ahead, 
. are responses scored as paranoid on the Guilford-Martin; whereas 

it is found empirically that these verbal reactions are actually signifi- 
cantly less common among clinically paranoid persons than they are 
àmong people generally. This kind of finding suggests that paranoid 
deviates are characterized by a tendency to give two sorts of responses, 
one of which is obviously paranoid, the other "obviously" not. [Meehl 


and Hathaway, 1946.] 
The original scales developed by the test authors are as follows: 


Hs—hypochondriasis " 
D—depression the so-called “neurotic triad 
Hy—hysteria 

Pd—psychopathic deviate 
Mf—masculinity-femininity 

P a—paranoia 

Pt—psychasthenia 

Sc—schizophrenia 

Ma—hypomania 


OPNP rn 


a for the reference group of normals provide a standard-score conversion 

at results can be plotted on a profile sheet as shown in Figure 78. Pri- 
Tur Significance is attached to scores greater than 70 (50 being the average 
E the reference group). This cutoff is somewhat arbitrary, and interpreters 
Xàmine al] peaks whether or not they cross this line. 

Sonto Keys. The MMPI is provided with several correction or control 
Ha Intended to identify or make allowance for exceptional response 
es. The simplest group of control keys are known as P, L, and F. 
siye in P Score is the number of times the person replies Cannot say." Exces- 
ject's Vasion of questions of course makes it meaningless to compare the sub- 
Storey POnses with the standardization group. Profiles showing high P 

are recognized as invalid. . 

^m are some test items so worded that a person who denies having 
ampl Symptoms is almost certainly not evaluating himself frankly. One ex- 
The 3 is the “I sometimes put off until tomorrow what I ought to do today.” 
Lr. (lie) Score is based on a count of such improbable answers. A high 
: ny indicates that answers are untrustworthy but need not indicate de- 
be de € lying. The L key detects some cases of faking good, but it cannot 

Thap ded upon to detect faking by sophisticated subjects, 
Biven ni alse) score consists, like the Kuder verification score, of responses 
S emely rarely. A high F count reveals carelessness, misunderstand- 
> Or otherwise invalid answers. The F score tends to be high for subjects 


472 ESSENTIALS OF PSYCHOLOGICAL TESTING 


who attempt to fake bad records, because rare responses are usually un- 
favorable self-descriptions. 


K, the fourth and most important control key, was designed on an empiri- 
cal basis. It was found, early in the test development, that some quite nor- 
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FIG. form 
G. 78. MMPI record of a male mental patient. (Data from Shneidman, 1951, p. 221- Prole 


copyright i 
Pyright 1948, The Psychological Corporation. Reproduced by permission.) 


mal individuals earn Scores above 70 in Hs, for example, because of ps 
have been called "plus-getting" attitudes. That is to say, these persons Bi : 
bina such complete frankness or self-depreciation that their response P 

S appear abnormal. Among patients, on the other hand, there are 2 larg" 
number whose scores remain below 70 because of defensive denial of symP 
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toms. In order to reduce the number of such misses and false positives in 
MMPI diagnosis, a key was made to measure defensiveness. The investiga- 
tors identified items commonly marked by the clinical cases whose profiles 
were less deviant than they should have been. The K key composed of those 
items expresses a bland “all is well” facade, e.g.: 

I have very few quarrels with members of my family. (T) 

Criticism or scolding hurts me terribly. (F) 


Whereas most control scores are used simply to signal untrustworthy pro- 
files, the K scale is employed in a regression formula to correct the regular 
Scales for test-taking attitude. Thus the original Hs scale was replaced by 
Hs + 5K. These corrected scales became the main keys for the test about 
1946. Although our discussion of validity is to come later, we can give here 
9ne example of the effect of the correction. When 200 normals were com- 
Pared with 101 hypochondriacal patients 5 percent of the normals and 62 
Percent of the patients exceeded an Hs score of 69.8. After correction, a cut- 
ting score which picked off 5 percent of the normals could detect 72 percent 
of the patients (McKinley et al., 1948). The “misses” were thus reduced 
from 38 to 28 percent. In subsequent studies by other authors, the K correc- 
tion has not been found consistently valuable. 

Still a further method of identifying faking is to score separately the ob- 
vious and the subtle items in any key (see pP 458 ff.). 


12. A client coming to a social agency has these MMPI scores: 


Hs D Hy Pd Mf Pa Pt Sc Ma 
43 45 50 50 50 68 42 6& 6€ 


How would the interpretation be affected if the “control scores” were as 
ol ows: 
B 2 72; L, 50; K, 50; F, 
: 8, 50; L, 73; K, 50; F, 50. 
a. 2 50; L, 50; K, 72; F, 
* 9; 50; L, 50; K, 35; F, 50. 

Descriptive Interpretation of Coded Profiles. Although MMPI scores derived 
h ; n i ; 
Im Psychiatric diagnosis, the diagnostic categories per se play little part in 
5 interpretation, At some point in the late 1940's, as Dr. McKinley reached 
es Point of retirement and Paul E. Meehl became more actively identified 
on research on the scale, a new viewpoint began to replace the original 

SNostic emphasis. By 1951 Meehl was ready to say (Meehl, 1951): 


These days we are tending to start with the test, sort people on the 
Wasi a£ di; rid dius een a good look at the people to see what kind of 
People they are. This, of course, is different from the way in which the 
test was built, and different from the usual psychiatrist's notion of a test 
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where you start with groups of people sorted out on some basis—for in- 
stance, by formal psychiatric diagnosis—and you try to build a test 


which will guess or predict or agree with that . . . criterion. . . - The 
idea that the primary function of psychometrics is to permit me . . . to 
prophesy what the psychiatrist is going to say about somebody is . - - 
not a very powerful way of looking at . . . the Multiphasic. 


For Meehl's purpose, any set of reasonably uncorrelated scales would pre- 
sumably have been at least as appropriate as the psychiatrically oriented 
keys of MMPI. By the time this new viewpoint emerged, so much experi- 
ence had been accumulated on the psychiatric scales that they could be re- 
placed only with great loss. The subsequent work on the test has been an 
attempt to work out meaningful interpretations for obsolete scales. 

In arriving at a description of personality, it is customary to consider the 
salient features of the MMPI profile simultaneously. The high and low 
points are listed in terms of the code numbers given above. For example, B 
32-6 profile is one in which scales 2 and 3 are exceptionally high with 3 being 
highest, and 6 is exceptionally low. (This individual, that is, has high counts 
in depressive and hysteric response categories, and is very much unlike the 


paranoid.) Some clinicians use extremely elaborate codes, but two- or three- 
digit codes are sufficient; if more detail is re 


more satisfactory than the code. 


i When they introduce a numerical code, the MMPI developers attempt to 
sidestep some of the consequences of having started originally from psy” 
chiatric diagnoses. The counselor should never tell a client that he has ? 


high schizophrenic score.” Such labels confuse even trained psychologist 
when the test is applied outside the mental hospital. Thus, although the tesi 
record forms still carry the labe 


Is Pd, Sc, etc., Meehl (1951) advises, 
If you can, get into the habit of using the code to talk about curves, 
instead of talking about the psychiatric category names. . . . It’s worst 
to talk about the schizophrenia key; it’s better to talk about the Sc key; 
it's best to talk about code 8. That is, of course, entirely in line bn 
. Starting with the test and looking at the people, instead of trying 
to guess the diagnosis. When you are working chiefly with relative y 
normal individuals, . . . it is still more desirable to avoid the psy chiat- 
ric implication. . . . If you talk about the 87's and the 23's, then Y^" 


: se 
can set up relatively fresh associations with the significance of tho 
numbers, 


quired, the original profile is 


Psychological significance is given to coded patterns by cumulating exper" 
ence with each type, The principal depository of this information is an 4 
which gives descriptions and case histories for nearly 1000 psychiatric P 
tients, classified by profile type. To take one example, a 50-year-old man 
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tested upon admission showed score 2 (D) over 70 with 3 (Hy) next high- 
est, and with 9 and 6 (Ma and Pa) as low points. The staff diagnosis was 
“psychosis, manic-depressive, depressed state.” After a month of treatment 
In a hospital, his profile had changed so that 8 and 9 (Sc and Ma) were 
above 70 and 2 (D) was lowest; but high L and F made this record suspect. 
At this time the staff changed the diagnosis to "paranoid," because the 
Symptoms had changed. The third test for this patient came two years later, 
Upon a readmission. At this time, he had returned to the 23 pattern, with no 
Very low scores. His diagnosis was again manic-depressive, depressed. The 
case history from the Atlas follows (Hathaway and Meehl, 1951, p. 120): 


_ The admission of this patient with severe depression of about two months’ dura- 
tion Was the latest of several such episodes, with seclusiveness, poor memory, in- 
ability to work, and somatic complaints the outstanding characteristics of the de- 
Pression. When he was working, he misplaced tools; and he was convinced that 
People watching him noticed the poor quality of his work. He complained of fail- 
g memory; he slowed down physically and mentally; he suffered from insomnia; 
there was loss of appetite; and in general he lost contact with his surroundings. 

acking energy, he found it very difficult even to dress himself or go to meals. His 
Speech was retarded and incoherent at times. 

A year before admission there had been a similar attack from which he had re- 
Covered after six electroshock treatments. Until this first attack, his behavior had 
aways been normal. His intelligence was average. A shy person, not socially ag- 
Bressive, withdrawn, and moderately religious, he had always been kind and had 
never lost his temper. A premarital dependency on his mother was later transferred 
9 his wife, His general adjustment to society was adequate although he was 

nown as a “drifter,” and at best he held only semiskilled jobs. Throughout his life 

ere had been a history of cyclic mood swings in which he moved from periods 

elation to periods of depression. ; , 
a» Admission he showed rather severe psychomotor retardation. He did not 
Ppear to be delusional. He had no paranoid ideas, nor was he suicidal. His sen- 


Sorj 4 sa 3 B 
"m and intellect were intact, and he had some insight into the depression. 
lere w. ative and expressed the hope that he 


M which brought about rather marked 
d talkative. He showed some 


d five days after the last of the eleven 
th ments he was discharged with almost complete remission of his symptoms. At 
time it was felt that he would probably have another depression. Twenty 

e patient returned to the hospital. Following his first discharge, he 

into prea: euphoric and unstable for about three months, then had begun to slip 
char oe depression which persisted until the second admission. He was dis- 
fo ci After seven shock treatments and was to return to the outpatient clinic 
PPortive care. The prognosis about a further relapse was very guarded. 


Yu. E 
á cs record we may note several points, the first being the essential con- 
SY of the two records taken upon admission, two years and eleven 
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r i was 
shock treatments apart. The intermediate record, taken ehig repa kun 
disorganized, showed a markedly different pattern, but je tru nissioti pro- 
challenged by the control scores. The high D scores: în the a aie nd 
files are consistent with the clinical picture, and the high Hy pads mády of 
sistent with his “withdrawn, kind, dependent” entenien: The -— pi datis 
his defenses, however, led the psychiatric staff to classify him a p: inu 
rather than neurotic. This illustrates the important point that ev pear $ 
MMPI scales use psychiatric language, they are descriptions of per 

ther than direct diagnoses. pos 

on * no simple dienen from MMPI information tota n 
terms. The user of the test must build up a repertoire of information iu dini 
Atlas, from other studies scattered through the literature, and from vo 
cal experience. For our purposes, the meanings of the scales can I e 
duced by Black's study with an adjective checklist. Each of 200 wom 
dents at Stanford rated other girls residing in he f on the 
adjectives which best described her. The girl also described te ele 
checklist and took the MMPI. The statistical tabulation then showed Table 
adjectives were applied to girls with high scores on any MMPI scale. à ied 
62, based on a portion of the results, shows what reputation (i.e. S MEI 
overt behavior) and what "published" self-concept goes with each M ro 
score. The tabulations are based on small groups and are therefore d 
ative of general trends rather than of well-established associations. — 

Studies such as this extend MMPI interpretation to normal persona A 
The various MMPI scales do seem to depict different types of peony it 
high score on 9 (Ma) may not indicate pathological lack of control, nis 
does indicate a colorful, dynamic, self-assertive person. Many other Sc* 
pick out recognizable types of overt behavior. «striking: 

The difference between the self-ratings and ratings by others is str istics 
The girls frequently use favorable adjectives to describe characteri am- 
which others describe in less flattering terms. The high 9's say, for ard 
ple, that they are enterprising and courageous, while others call theni ically 
ful and selfish. The self-description in some cases, indeed, is N 
Opposed to the reputation. The high 9's see themselves as popular and a gly 
able, but acquaintances apply these adjectives to them rarely. This a pa 
reinforces the view that the self-description is a statement of what th ot an 
son believes about himself or what he wants others to believe, and "direct 
adequate report of typical behavior. On the other hand, the MMPI’s in 
“sign” interpretation of the self-descri 
havior by stating what a claim of popularity “really” means. lop sup” 

Supplementary Keys. Numerous investigators have tried to deve A Dahl 
plementary keys to identify subgroups of various types. Welsh an socio- 
strom (1956) mention 100 supplementary keys, including scales for 


r dormitory by checking 


a] be 
ption may come close to typical 
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€conomic status, dominance, prejudice, and social introversion. These scales 


have not gained wide usage. 
Among the many scales derived from the MMPI, one has attained special 


Prominence in clinical diagnosis and research, and it, oddly, was not intended 


Taere 62. Typical Behavior and Self-Descriptions Associated with MMPI 
cores of College Women 


Scale or Pattern Description by 


with High Score N Dormitory Mates Self-Description 


Shy, moody, not energetic, 


2. Depression 16 Shy, not energetic, not re- 
not relaxed, not decisive 


laxed, not kind 


3. Hysteria 25 Many physical complaints, Trustful, friendly, not emo- 
flattering, not partial, not tional, not boastful 

13 clever i . . . 
or 31. Hypo- 9 Many physical complaints, Affectionate, partial, not or- 
chondriasis indecisive, high-strung, se- derly, not conventional 
with hysteria clusive, eccentric, apa- 

4 thetic : N . 

* Psychopathic 26  Incoherent, moody, partial, Dishonest, lively, clever, not 


not adaptable, not friendly, 
not practical 

Shiftless, not popular, unemo- 
tional, not having wide 
interests 

Self-distrusting, self-dissatis- 
fied, sensitive, shy, un- 


deviate sociable, frivolous, 
5. M ; self-controlled 

: Masculinity 15 Unrealistic, natural, not 
dreamy, not polished 


5 low, Feminin- 68 Worldly, not energetic, not 


ity rough, not shy D 
6 realistic 
7 Paranoia 24 Shrewd, hard-hearted Arrogant, shy, naïve, sociable 
* Psychas- 20 Dependent, kind, quiet, not Indecisive, soft-hearted, de- 
9 thenia self-centered pressed, irritable 

- Hypomania 52 Shows off, boastful, selfish, Enterprising, jealous, coura- 


geous, energetic, popular, 


energetic, not loyal, not 
abl peaceable, self-confident 


peaceable, not popular 


1956, pp. 151-172. 


Source: After Black; see Welsh and Dahlstrom, 


" Practica] use, Spence and Taylor wished to test the effect of anxiety 
f Hull's theory of drive (Spence, 1958). 

et Presumed, from previous theory, that persons with marked, admitted 
“à ty Symptoms had higher levels of drive and thus would more quickly 
Pega a conditioned defense reaction. In order to identify extreme groups 
"d Simple, unsubtle measure of anxiety, Taylor requested experienced 
elors to choose MMPI statements which constituted overt admissions 
naire Xlety. The items so selected were combined m a short questione 
the e and used for the laboratory studies of learning. When a puff of air anta 
a Was associated with a bright light, eyeblink zepon to the light 
Or, ie Were far more numerous among the “nonanszous subjects J. Tay- 
Ors a 1j. The questionnaire was subsequently adopted by many investiga- 
E: nd clinical counselors, under the name of the Taylor Manifest Anxiety 
*- The scale has not been standardized, validated, or published in the 


Pon ] ; 
earning, in an extension o 


LI 


. 
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usual sense, and it appears to have no special virtues to recommend it for 
clinical purposes over other adjustment indicators. 


VALIDITY OF INVENTORIES FOR SPECIFIC DECISIONS 
Screening of Deviant Personalities 


Separating Patients from Normals. Although the MMPI was designed with 
psychiatric diagnosis as a criterion, the authors early abandoned claims that 
the test had great power as a discriminant. Some of the scales had reason- 
ably satisfactory relations to the criterion, but others were regarded as ques 
tionable even at the time of publication. In papers by the test authors one 
finds many remarks like these: “The evidence for the validity of Ma is cet 
tainly not conclusive.” The published version of the Sc scale “was only 
slightly better than the ones that were rejected.” “The Pt scale has never 
been considered very satisfactory.” And for Pa, “Cross-validation was al- 
ways disappointing.” This frankness is in welcome contrast to the glowing ac 


counts of other test developers who have made less effort to validate their 
instruments. 


Under favorable conditions, the various scales ( 
are especially weak) have more or less the same discriminating power. 
cutting score which yields 5 percent false positives among normals will iden- 
tify from 62 to 74 percent of the patients in the category to which the scale 
corresponds. The precise character of the data is indicated in Figure 1% 
which shows the distribution on scale 2 (D) of 690 normals and 35 patients 
who had previously been identified as clinically depressed. The data show á 
distinctly higher mean for the patient group, 63 percent of them falling at oF 
above 78, which is the 95th percentile for normals. It is hard to say without 
further analysis whether the screening validity is high enough to be useful- 


hi : -— 
Ps order to face this question, we must take into account the number of 4 gi 
15) cases in the population likely to be tested (Meehl and Rose™ 


As an illustration, let us assume that among the persons coming to 4 clini? 
50 percent are depressed. Then let us change this figure to other base a 
20 percent, 5 Percent, and 2 percent. Figure 80 plots the probability that : 
row With each score will be depressed, using the distributions of Fig 
TD with each base rate in turn. Again, we see the clear relation betwee? scor 
E probability of being properly called a depressed patient. The value ? 
TG test in diagnosis depends heavily on the base rate. One will be r ight 
percent of the time if classifying a person with a score of 70 on scale 
depressive—if depressives constitute half of a clinic’s intake. If, as is ™ Ee 
gestas Proportion of depressives is 20 percent, one must shift the cu 


except 5, 7, and 8, which 


ore 
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Normols (smoothed distribution) 
N=690 


5% 
SSS anh 


20 30 40 50 60 70 80 90 100 


Depressed Patients N=35 


20 30 40 


Standard Score 


FIG. 79. Distributions of normal and patient groups on MMPI scale 2 (D) (Starke R. 
Hathaway and J. C. McKinley, 1942). 


ting score to 88 to have the same confidence in a borderline diagnosis. Such a 
shift, however, leaves over half the depressive group undetected. A poor D 
Score is therefore far from dependable as an indicator of maladjustment. A 
800d score does permit confident judgments; with any reasonable base rate, 
the tester can be sure that a person scoring 50 or below is very unlikely to 
7€ depressive, If these low scores are passed over while the remaining 
Cases are submitted to further interviewing or testing, virtually no depres- 


“Ves will be overlooked. 


. Base Rate 
1.00 50% 
20% 
E 
S 9j 
Dd 
g 9 575 
q 
o 
Sa ag 
Ec 
38 2% 
g5 


0 
SD 40 50 60 70 80 
T Score on MMPI Scale 2 (D) 


FIG. 80. Probability of correct identification of depressives. 
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for other 

Military Screening. Essentially similar pesulie have n im 
modern questionnaires, though the validity of «d ao s Actes 
screening military populations (W. A. Hu 1 by the Navy to determine 
inventory of only twenty items was profitably use y ano N 
which men should be seen by psychiatrists for einige : "d aceite 
charge as unfit. With a cutting score set to allow 5 percen ipae pues 
93 percent of the discharges could be identified. A ques : 5 e 
2081 Seabees successfully identified 281 cases later adjudge k r tha ney 
trists to present neuropsychiatric conditions, missed 16 who emn dine A 
chiatrists’ attention through difficulty on duty, and w-— cn ae 
oe ee aa 
that the tests permitted psychiatrists to omit individual interv 
men—not a trifling saving. decet 

Screening of ste — to screen students to find eme pie 
counseling have been more disappointing. Over 800 college ied a diag- 
interviewed repeatedly during the year by counselors who then ma nal die 
nosis of the kind and extent of maladjustment (Darley, 1937). rs ager 
justment Inventory given at the start of the year identified 40 sn -. dise 4 
having problems relating to home adjustment, but missed 41, an hs ye 
78 false positives. On emotional adjustment, there were 32 hits, pe good 
and 42 false positives. In a study of the Bernreuter, an exceptions! ii 
criterion was used—observation records gathered continually X» A. 
year. Of sixteen girls at the maladjusted extreme on the Neurotic salad 
ency scale (out of 81 subjects), only six were considered actually : ai 
justed, whereas two of those least maladjusted according to the tes bou 
rated maladjusted on the criterion. The Self-Confidence scale Wer ali 
successful. Ratings agreed with test scores for all ten girls showing ani high 
low confidence on the test, and for six of eight whose test scores showe 
confidence (Feder and Baer, 1941). — ii 

Although errors are too frequent to warrant trust in apo 
dicators of maladjustment, scores have validity better than chance. oning 
and Thomas (1938) found that the mean score among students T Keys. 
voluntarily for counseling was significantly deviant on the Flanaga sg for 
The correlation of Self-Confidence with rated maladjustment was -— 
men. Overlap in scores between normal and counseling groups was to 


ur ize in 4 
for screening validity. Correlations with ratings were of negligible siz 
group of probationary students referred f 
likely to be valid in 


Operative. 


Identifying problem cases by self 
proved to be very difficult. Investiga 


at 


+. more 
is m 
or assistance; any i oe nei 
i i i ho à 
Broups seeking assistance than in groups w 


s has 


n €: 
-report methods at earlier ag ents 


ingu 
tors who compare known deling 


PERSONALITY MEASUREMENT THROUGH SELF-REPORT 481 


with normals find some differences in scores, but the scores of the groups 
overlap so much as to discourage reliance on the tests for screening. On the 
Heston Personal Adjustment Inventory, 80 to 48 percent of delinquents 
fomes to S to 16 percent of a matched control group) fell below the 
= percentile on Emotional Stability, Confidence, and two other scales 
(Hathaway and Monachesi, 1953). This relation is far poorer than the level 
of discrimination achieved by psychiatric interviewing. 

An exceptional study tested all ninth-graders in Minneapolis and followed 
the 4000 cases for two years, Predictive validity was examined by comparing 
those who later became delinquents with the remainder. Several significant 
differences were found, the main results being indicated in Table 63. Scales 
* and 4 were prognostic of delinquency, and codes 2, 5, and 7 were rela- 
tively rare among delinquents. The F scale proved to be the most indicative 
of potential delinquency. Scale 4 indicates an acting-out, impulsive person- 


ality, i ss 4 
ality, insensitive to social controls. 


TABLE 63. Rate of Juvenile Delinquency for Various MMPI Profile Types 


Percent of Percent of 
Boys in Code Percent Girls in Code 


Percent 
of All Boys Class Who of All Girls — Class Who 
? Falling in Became De- Falling in Became De- 
High Score Code Class linquent Code Class linquent 
p 5 48 3 22 
4 (Pd) 21 28 19 12 
9 (Ma) 21 22 17 8 
2 (D) 4 12 1 3 
3 (Mf) 5 9 17 5 
7 (Pt) 6 19 7 4 
Total, all 100 22 100 7 


classes 


Sovncr: Hathaway and Monachesi, 1953, p. 181. 


13, 
What does Table 63 indicate about the practical value of MMPI for screening 


14 petal delinquents? 
15, wii urn or predictive validity the primary concern in screening studies? 
9t importance can be attached to the finding that Pd scores decrease 


markedly with age? 


ag of Validation Studies. The studies above are based on a com 
ment, i of one score at a time. Writers on the MMPI have stressed judg- 
eon s ape. on all scores together. It is well established. that a deviant 
SOS has high averages on several scales, not just the one "appropriate to 
inp Ee and this is a reasonable finding since a disorder often involves 
shoulg ee of symptoms. The argument is made, therefore, bus screening 

àke into account many scores at once by means of a linear combina- 
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2 

tion, a nonlinear combination, or a multiple cutoff pattern. (See Chapter a 
for the distinctions between these methods.) While one can find cal 
papers arguing for such screening methods, it is difficult to locate I as 
about the accuracy of multiscale analysis or evidence comparing us eni 
scale screening. Statistical studies of this question are few and ina Ko He 
reported. One reason, of course, is the shift in emphasis from ut walled 
terpretation to descriptive interpretation shortly after the M? ite 
were put into their final form; another, the fact that too few cases o "d ins 
pattern are found for adequate statistical summary. Some epa eumd 
use of MMPI patterns to distinguish between types of patients is pre 
below. , 

Before trying to explain the generally poor performance of mico n 
screening devices, let us emphasize certain aspects of the design o nine 
tion studies. These points are important because some articles report s ipe 
Success in predicting various criteria, and in many instances the € em 
success merely results from improper analysis. The first error to be no : de 
validation of a key or scoring formula on the same cases used to select : e " 
and establish weights. As was pointed out in Chapter 12, cross-validation 


R i the 
essential to avoid giving credit for chance discriminations peculiar to 
sample studied. 


A second common fault is to demonstrate si 
tween delinquents and normals) 
fulness of a screening or categori 
(not the percentage) 


gnificant differences (e£ isi 
without examining the base rate. The Le 
zing instrument depends upon the pre 
of misses and false positives at any cutting score. ree 
enough cases, highly significant differences can be established for in$ 
ments which have no practical value. s. 
A similar remark is to be made about comparisons of extreme gohe 
Gough (1957) shows a very significant difference on the Sa scale of pd s as 
fornia Psychological Inventory between boys nominated by princip? each 
most and least self-accepting. This comparison is based on 52 p pa cor- 
group, selected from among six high schools. Gough computes a biser!2 a 
relation of .46 for these data. But as can be seen in Figure 20, eee pes 
à great difference between extreme groups even when the correlation com 
on the entire population is very low. A correlation coefficient must be ill be 
puted on (or estimated for) the entire population to whom the test “mate 
applied. If, as Gough's report seems to indicate, the principals = that 
the extreme 1 percent of their student bodies, a recomputation indicate 
the true validity of Sa is approximately .15, not .46. ailable 
Explanation of Results. The discouraging results for even the best aV der ° 
inventories can be explained in two rather different ways. The pest the 


i i able; 
the inventory will argue that the evidence is, on the whole, favora 


se 
sre x : ee beca" 
critic will argue that the inventory is inefficient either in principle or 
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of poor design. The defender can argue that the criteria used in validation 
and in scale construction are themselves invalid, and, indeed, that a test 
which predicts diagnosis perfectly would be far from a true picture of per- 
sonality. The diagnosis of maladjustment is controversial at best. Psychia- 
trists disagree as to what categories should be used and disagree in their clas- 
sification of individuals. Clinical staffs have such marked biases toward the 
use of certain diagnoses that it has been said, only half jokingly, that 
whether a patient is called psychotic or neurotic depends as much on the 
hospital he enters as on his symptoms. , A 
Meehl insists that the diagnosis is at best only a starting point. Just as Binet 
started with bright pupils selected by teachers, the Minnesota testers started 
with diagnosed hypochrondriacs; but the intention in both cases was to de- 
velop an instrument which would be superior to the starting criterion—ie., 
which would in the end disagree with it. The Binet scale, they point out, has 
value precisely because it detects bright children that teachers overlook, 


and corrects the teacher's overfavorable evaluation of other cases. While the 


attack upon diagnoses is legitimate, one must be wary of any implication 
that when test and psychiatrist disagree the test is the more dependable. 
Evidence to support this type of claim has not been developed for the MMPI 
35 it has for the Binet scales. NY 

A second pertinent defense is that in many studies the criterion is crudely 
determined, even if in principle it could be made dependable. Thus the data 
On depressives presented above are probably unfair to the MMPI. The pa- 
tient group included some cases who might have recovered from their 

SPressive phase before testing, and the normal group, including as it did 
“nhospitalized relatives of patients, may well have included numerous un- 
tected depressives. More generally, nonpatient status is no guarantee of 
“ound personality; many persons in the community have serious tanked ust 
Ment which remains undetected only so long as they are exposed to no ex- 
Ceptiona] stress. The fact that the pressures of life are not the same for all 
Persons greatly reduces the prospect of predicting behavior from personal- 
y Measures, 
vestigator could produce a bet- 


t E ; 
remains a question whether some other in : 
re are many reasons for think- 


ter se 

; ‘eening instrument than the MMPI. The 

8 that it is far from the most efficient actuarial instrument that could be 
“veloped. In the original derivation of scales, the number of cases of each 


Patient B&roup was generally below fifty, and often below thirty; as a conse- 


Hence, chance may have played a large part in assigning items to scales. 
9reover th à ite differently motivated in tak- 

i > the pat rmals were quite 

P th Peni andino o be encountered when 


et © test, and neither had the motivation likely t 
est is used for screening. The scales have lo 


wer stability than desirable, 
e 3 . 
Median for normals over one week being -80 ( Cottle, 1950a). Items were 
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combined with little regard for their intercorrelations, yet in — cad 
diction it is profitable to maximize item heterogeneity so as to raise the c 3 
relation of the scale with the criterion. Separate scoring of subite and o ; 
vious items would probably be of value; despite the empirical en 
MMPI keys, there is evidence (McCall, 1958) that the obvious ede à 
items carry almost all of the discriminating power. Finally, althoug 1 Tn 
evident that combining several scales can improve differentiation, "d 2 
mula for combining the scales to separate normals from pom has et 
Systematically validated. Even with improved test construction, the i E 
tion from one population to another Cag between a community clinic e i 
small town and a city hospital) may be so great that even the most powertu 
actuarial test will have very limited general validity. TT 

We may summarize the findings on personality tests as screening instru 
ments as follows: In dealing with large populations ( military recruits, college 
Students, etc.) where individual attention cannot 
questionnaires validated on that type of popul 
preliminary screen. Persons with better score 
more systematic diagnostic procedures are ap 
number of deviant cases missed is reduced if 
faking are applied. It is never proper to assu 
Scores on a questionnaire are serious] 


tives makes it imperative to regard t 
tion. 


be given to everyone, 
ation are of great value tt 
s can be passed over while 
plied to the remainder. or 
suitable methods to contro 
me that those earning poor 
y maladjusted; the number of false p 
he test as only a first stage of investiga 


Differential Diagnosis of Patients 


Sion as to the probable 


MMPI profiles for various diagnostic groups differ significantly, and ex 


rje 
assify profiles with some success. Guthi 
(1950) asked them to classify 


; r 
: :onistic 9 
t analysis of MMPI scores, whether impressionis 


actuarial, is at best a source of hypotheses about diagnosis to be checke 
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Other methods. In this role, it can be of definite assistance in the clinic. 

Results on differential diagnosis with questionnaires other than MMPI 
have in general been unencouraging, and in recent years the MMPI has 
displaced all competing questionnaires for this purpose. 


16. svip keys for differentiating medical specialist groups from each other had 
little correlation with the key for separating physicians from men in general. 
What does this imply regarding the design of keys for distinguishing one type 


Of patient from another? 


Prediction of Vocational Criteria 


Inventories have had rather little success in predicting employee perform- 
nce. Ghiselli and Barthol (1953) found 118 correlations between job pe 
ciency and presumably relevant inventory scores. Nearly all correlations 
Were positive, The average correlation was as high as .36 for sales personnel 
at only .14—.18 for supervisors and foremen. There was a wide range among 
Coefficients for the same occupation. The Ghiselli-Barthol averages are prob- 
ably unrealistically high. Since investigators file and forget hundreds of 
Studies With small samples which showed unpromising relations, only a 
tased selection reaches publication. . . . 
he experience of Household Finance Corporation is ronsistent with these 
atisties, Wonderlic and Hovland (Moore, 1941, p. 60) report: “Our early in- 
Vestigations were carried on with published tests which have been standard- 
“ed by others, . . . We were unable to find any test in which the total "o^ 
Was significantly prognostic of success in our organization to warrant its 
clusion a, part of a selection program. In the cases of many of the purchas- 
e Personality tests, results were obtained which ran counter to expecta- 
ns, Clerical workers seemed to be more aggressive than salesmen, sales- 
TUS Were higher than managers.” . : 
Tventories must inquire about typical behavior rather than behavior un- 
er Specific conditions. Regardless of what a person is prone to do when 
hm free choice, he adapts himself to the demands of different situations. 
off Pe assertive as a parent, je in a dme = d 
> Poisterous at a party, decorous in church. Feop Ty vau ta 
on, me roles, but there is no evidence that oe E assume qe y 
is Y the roles which match his typical behavior. lri 2 "T ped ity 
tk ? Posture; the young man who slouches habitua y can be placed in 
orm and trained to hold as rigid a military bearing as anyone else. Per- 


*Onallity, as commonly measured, probably has much to do with the sort of 

i and personal oh ap ekon seeks, but has little to do with his abil- 
Y to N T xd d justed person is abl 

adap, Perform a role when thrust into it. The adjusted person is able to 


3 style to role demands. 


St 


tio 
cl 
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Dependable use of personality inventories for guidance Requires ei 
study of men who succeed and remain in particular occupations. The in 
adequate work of this kind has been done with the Kuder Preference ` we 
ord— Personal. Figure 81 shows some of Kuder's evidence that occupations 


E 
A B c D t 
Being active Being in fa- Working with Avoiding con- gero 

in groups miliar, stable ideas flict 

situations 

2 P Clergymen fed 
8 E Insurance c baise 
d Lawyers lergymen salesm 
Uu» 
aoe salesmen Farmers Physicians Accountants 
D Clergymen 
T 60-64 


ts 
Clergymen Physicians Accountan 

—— T o S  Physiiens Accountant 
——— "Pn Wn O O O OUOU 


Insurance Physicians 
i B 35-44 Farmers salesman 
"8 25-34 Physicians lawyers Farmers 
[2] y, 


occupation are shown. In the 
not significantly away from th 
Preference Record—Personal.) 


H H i s 
€ variation from situation to situation, Boveri ch 
F ; i i : 
ailure only in a definite job in a specific firm. 


sn the 
i n 
rable improvement of average sales per man ! 
9 men who quit in their first year, it was ea 
ndex produced 206 percent as much business 
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average agent, while those rated E produced only 41 percent of the aver- 
age (Kurtz, 1941). In general, for prediction of success biographical inven- 
tories seem to be more satisfactory than questions about personality. 


17. Should inventories be used to advise students about their probable vocational 
Success? ! 
18. What advantages does a biographical inventory have over a personality ques- 


tionnaire in employee selection? "TE m 
19. What differences among clerical jobs might account for variation in the validity 


of personality tests? in hi 
20. How might inh of self-confidence help one student to attain high marks, yet 


© a drawback to another? 


VALIDITY OF INVENTORIES FOR TRAIT DESCRIPTION 


The Test as a Mirror for the Counselee 


In Counseling, the personality inventory is used like the interest inventory 
A help the individual examine his own characteristics as in a mirror. He 
Rows what he has said, but the test permits him to compare himself with 
Others, His percentile standing in various traits is an appropriate initial 
topic in Counseling. For this purpose, it is probably not wise to use subtle 
Scales oy Scales whose meaning is difficult to communicate, since the instru- 
Ment seryes primarily to reflect the counselee's own professed attitudes, To 
3 ow a Counselee his MMPI profile could lead only to difficulties in explain- 
Ing the meaning of the categories, and possibly to his rejection of inter- 
Pretations based on subtle items. 
hat inventory is preferred will depend upon the nature of the counsel- 
Mig. Genera] adjustment inventories or other single-score instruments are of 
little Use in counseling since they pose few questions for discussion. A de- 
Scriptive Scale reporting introversion, impulsiveness, and so on is of potential 
alue in vocational guidance and may open discussion of traits which the 
client regards as ric Descriptions in terms of preferred activities (e.g., 
Kader Preference Record—Personal ) and values ( Allport-Vernon ) are 
™Mewhat better suited to vocational guidance than scales which describe 
Motiona] reactions. The Mooney Problem Checklist is of considerable value 
apo u it draws Steere to specific concerns the client is ready to talk 


Cut and wants help with. It is, in effect, a preliminary interview rather 
an 1 A 
A 3 measuring device. . 
is C SCtiptive inventory useful in initiating counseling of college students 
‘ea of Edwards. The profile describes fifteen “needs” which presumably 
ect acd it 
bes the subject’s actions. Some of the needs, and items related to them 
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Abasement—to accept blame when things do not go right 
Achievement—to be a recognized authority 
Affiliation—to be loyal to friends - 

i c / points of view 
Aggression—to attack contrary poin v " u 
ium be independent of others in making decisions 


" refers 
The items are paired and the subject chooses the d " bao D plicit 
in each pair. The interpretation of the scales clings ie, € the summary in 
content of the items, which aids communier aat, Aa " 1f. The counselor 
terms of needs may add to the subject's insight into oa : lon enenatis 
can help him examine how his major needs are gp; wr Pm n Hi 
how well his future plans will satisfy these needs, or how fa 
earlier development caused certain needs to develop. — 
Where a test is used as a reflection of the client's remarks, ke be given 
tions related to validity are of interest, Only partial miwa sad ost 
here, but further information on particular tests being considered as 
ing aids should be obtained from the test manuals. concent? Would 
€ Are the scores adequate measures of the published self-conceg asaliel- 
another set of items give the same profile? This is to be answered Ld eee — 
form or internal-consistency reliabilities, or by correlations between of .80 
tories having similar scales. The better inventories show reliabilities 
and above, which is sufficient to pick out salient ch : which 
the same name in different inventories may have low correlations, 
emphasizes the need for cautious interpretation. - d sev- 
€ Do scores reflect lasting characteristics? E, L, Kelly Ru we 
eral questionnaires to 300 engaged couples during the years — use 
retested nearly all the subjects again in 1954, Among the instrumen ge 
were the SVIB, the Bernreuter, the Allport-Vernon, and the unge ‘king 
eralized attitude scales. The stability coefficients in Figure 82 show a $ apart. 
degree of similarity between self-descriptions given twenty D nrelia- 
The interest scores are most stable, but when we allow for the initial onality 
bility of the Allport scale it appears that values are equally stable. Per ES sw 
Scores are only slightly poorer. Attitudes, on the other hand, are T " mean- 
porary. While the self-concept seems to remain relatively stable, hi 
ings attached to the regt of the world change greatly with experi of chil- 
We may also present evidence on stability arising from a n -atings 
dren’s personalities, These data are not from questionnaires or e "hildre? 
Trained interviewers asked mothers to tell the extent to which thane Jso™ 
showed such problems as insufficient appetite, nailbiting, and quarre 


safi " with 
aracteristics. Scores 
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Interests (SVIB) 
Architect 


Office Manager 


Minister 


Values (Allport- Vernon) 
Economic 


Political 
Personality (Bernreuter) 
Self- Confidence 
Sociability 
Personality (Self- Rating) 
Breadth of Interest 
Dependability 


Attitude (Remmers) 
Marriage 


Church 


30 40 .50 .60 70 80 .90 1.00 
i dias Retest Reliability after Twenty Years 


. The dot indicates the reported 
ee he same longitudinal study. 
tal tests in Table 21 (p. 176), which see oon ra year are much higher 
In both tables, we find that rp m M laake coeficient 
ten am sss, ssp ; al tests, but the drop is 
E Personality scores sro lower than those gei d on the stability of 
Surprisingly small. While we do not have in 


" fB ith 
TABLE 64. Correlation of Problem Behavior Score of Boys wit 


Score at a Later Age 


12 
Approximate Age 


i Second Score 
tween First and 
Years Elapsed T w 


27 —.01 

40 j a 

13 .38 731 = 47 

3 550 a E - 

4 56 2 a - 

$ X 155 5 € 
4 3 75 
9 70 p.i 

n 86 


S : Macfarlane et al., 1 
ounce: Maefarla for children, these data show beyond a 
res ior 


inr ey arenas REB t deal of stability over at least a 


: as a grea 

Subt that problem behavior itself has a 8 

Se-year Span. al evidence of typical behavior? 

> Do the descri tions agree cipe qe ecords of behavior is 
arch Wero [f-descriptions with objective s 

b comparing self- 


i S s wi ju gments. 
arisons of scores with d 
ki mpa 


s co 3 
The Kin ia 2 epe ble is usn that children's self-reports have a 
Y reported in Ta 
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i s 
quite small degree of correspondence with the way their qiu nn 
rate them (see also Powell, 1948). With college students, we EE 
validities have been reported. Gordon (1953), for example, oun and 
tions of .47 to .73 between scores on his inventory and ratings Mm De 
by dormitory mates. These are concurrent, not predictive, validities; 


i ures for 
TABLE 65. Correlations Between Three Types of Adjustment Meas 
Ninth-Graders 


Peer Self-Reports 
Teacher Judgments Judgments ^ Self- Social i 
Forced Adjust- Adjust- iy 
Rating Choice ment ment 
Teacher judgments: .22 
Rating. of adjustment 77 56 .30 p 15 
Score on forced-choice Ferd 56 28 . 
descriptive question- 
naire 3 
Peer judgments: K 
Norine ensign desirable — .56 56 28 28 
traits 
Self-reports: 73 61 
California Test of Per- .30 28 .28 . 
sonality Self-Adjustment 
California Test of Per- 
sonality Social A7 
Adjustment .33 29 28 73 
SRA Youth Inventory, 
Basic Difficulty score .22 15 16 61 A 


kin 
correlations for the EPPS were obtained in an unpublished study by um 
and Klett. (Cf. also Table 62.) 


© Do the descriptions agree with the true self- 
be obtained by asking a therapist who is well 
inner attitudes to describe him. Evidence of 
Rogers cites correlations ranging from .38 to .48 
ment inventory and ratings by clinicians, but th 
not adequately reported (1931) 


ton of the 
21. Applying the method of p. 138 to Figure 82, about what proportion 5 pro* 
variance in self-confidence is due to random error of measurement, x stable 
portion to genuine but unstable characteristics, and what proportion 1o 
characteristics? 


. "n i 4 A 
In counseling with the Edwards inventory, should raw scores (ranging 
on each scale) or perc 


entiles be used to plot the profile? 


1 
:terion could 
concept? A criterio ; 


s 
i rson 
acquainted with the pe p^ 

this sort, however, H rude 
between scores on Bise I 
is is an isolated investig? 


22. 


Descriptions to Aid Institutional Decisions 


ith 
: s wit 
The second major descriptive use of inventories is to provide wt 
insight regarding the individual, This may be important in clinica 
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Sis or other institutional decisions, and in prescriptive counseling. The ther- 
apist may wish to know as much as he can about the person's conflicts long 
before they emerge in interviews. The college counselor (see p. 456) may 
Wish to know what hidden attitudes are preventing a student from doing his 
best work. For these purposes, subtle tests may have distinct advantages in 
the hands of highly experienced testers. 

The MMPI is the generally preferred instrument for clinical patients and 
for many counseling uses. The interpreter must bring to bear information 
from the Atlas and other sources in order to translate scores into psychologi- 
cal constructs. An illustrative description is that given by Grayson (Shneid- 
man, 1951, pp. 268-269) for a 25-year-old veteran (see Figure 78): 


This profile may, with a good deal of confidence, be considered as a valid repre- 
Sentation (F within limits) of a seriously disturbed patient (unusually elevated 
Nairn who has a tremendous amount of anxiety and depression (D), with sad, 
Vorrisome feelings of inadequacy (high D, Pt). The patient is self-depreciative 
d °W L, low K) and lacking in self-assurance to the point of being a compulsive 

Oubter” (high ?). He has attempted to resolve his anxiety through hysterical dis- 
Placement (high Hs) and obsessive-compulsive mechanisms (high Pt) but these 
ewe om unsuccessful, with the result s the geri cmo 2s rom 
ak ego structure (high Sc). In addition, the pat a : - 
ac of hostility faa bi eie self (high Pd) and others (high Pa). The 
ng feelings of anxiety and depression (D) combined with e eue and ex- 
aggression (Pd, Pa), in an individual who possesses insufficient ego con- 
) to inhibit his tendency to act out impulses (Ma) add up to an explosive 
ap, "€ Which presents strong possibilities of suicidal and homicidal behavior. Di- 
Snostically, the patient may be classified as incipient paranoid schizophrenia. 


ac validity of the description is attested by the full case aar and 
Nee Protocol. The following statements are made is a psi | y the 
gs Psychotherapist: “He seemed suspicious, indecisive and unable to re- 
^w ' * - There seems to be considerable guilt in relation to his own hostility. 

r has established some defenses against this through obsessions but these 
P enses are cracking and he fears that his hostile impulses might become 
qo Est that he would be unable to control them. . - - The patient seemed 

essed with thoughts about death, homicide, and suicide. 


Ones ¢ MMPI is not entirely suitable for normal groups, particularly younger 
atisfactory") arouse criticism from 


; SO0me of the i « ç life is s 
ether and tage T. E of clinical origin produce information 
È Cipally on 'tüidesirable traits. The California Psychological Inventory 
M and Minnesota Counseling Inventory (MCI) se descendants of the 
Stude Specifically designed for relatively normal high-school and college 
bili nts. The MCI keys are labeled Family Relationships, Emotional Sta- 
ie Conformity, Adjustment to Reality, Mood, and Leadership. Some of 

eys correspond rather closely to the MMPI clinical scales in general 
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purpose (e.g., Conformity is a substitute for Pd), but the MCI labels carry 
less damaging connotations. 

The validity of descriptive interpretations is difficult to assess, especially 
when the construct employed cannot be equated with any one observable 
behavior. MMPI scales have been given meaning by integrating evidence 
from all manner of studies, gradually formulating a psychological hypothesis 
about the meaning of each score. Meehl's remarks on the Pd scale illustrate 
the process (Cronbach and Meehl, 1955):* 


The Pd scale of MMPI was originally designed and cross-validated 
upon hospitalized patients diagnosed "Psychopathic personality, asocial 
and amoral type.” Further research shows the scale to have a limited 
degree of predictive and concurrent validity for “delinquency” more 
broadly defined. Several studies show associations between Pd and 
very special “criterion” groups which it would be ludicrous to identify 
as "the criterion” in the traditional sense. If one lists these heterogene- 
ous groups and tries to characterize them intensionally, he faces eno- 
mous conceptual difficulties. For example, a recent survey of hunting ac- 
cidents in Minnesota showed that hunters who had “carelessly” shot 
someone were significantly elevated on Pd when compared with other 
hunters. . . . The finding seems to lend some slight support to the con- 
struct validity of the Pd scale. But of course it would be nonsense tp 
define the Pd component “operationally” in terms of, say, accident prone- 
ness. We might try to subsume the original phenotype and the hunting- 
accident proneness under some broader category, such as *pDispost 
tion to violate society's rules, whether legal, moral, or just sensible." put 
now we... are using a rather vague and wide-range class. - + ° 
nales hen class specification to cover a group trend that Sere 

ent) high school students judged by their peer group as least t 
sponsible score over a full sigma higher on Pd than those judged mos 
responsible”, , , Again, any clinician familiar with MMPI lore wow 


predict an elevated Pd on a sample of (nondelinquent) profession? 


> E th: 
Pepe Chyatte's confirmation of this prediction tends to support Be 
ü 


the theory sketch of “what the Pd factor is, psychologically " ol 
(b) the claim of the Pd scale to construct validity for this hypotheU? " 
or. Let the reader try his hand at writing a brief phenotypic ade 
specification that will cover both trigger-happy hunters and Broadw?y 
actors! And if he should be ingenious enough to achieve this, does P i 
definition also encompass Hovey's report that high Pd predicts the jue 
ments “not shy” and “unafraid of mental patients” made upon gne 


A ; is as” 
their supervisors? And then we have Gough’s report that low pai 


de- 


fact 


2 
References for the studies described are Eiven in the original. 
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sociated with ratings as “good-natured,” and Roessell’s data showing 
that high Pd is predictive of “dropping out of high school.” The point is 
that all seven of these "criterion" dispositions would be readily guessed 
by any clinician having even superficial familiarity with MMPI inter- 
pretation; but to mediate these inferences explicitly requires quite a few 
hypotheses about dynamics, constituting an admittedly sketchy (but far 
from vacuous ) network defining the genotype psychopathic deviate. 


This body of evidence leaves little doubt that Pd has some relation to in- 
ternal personality structure. The correlations cited do not represent strong 
relations; if they did, we would find the same person dropping out of school, 
tated ill-natured, becoming a Broadway actor, and shooting a fellow hunter. 
Circumstances dictate much of behavior. Personality structure, even if per- 
fectly measured, represents only a predisposition rather than an absolutely 

etermining force. 

Granting that Pd and other scores have some validity, we are still uncer- 
tain as to the closeness of correspondence between the scores and the true, 


hidden personality structure. Before interpretations can be used with confi- 


d f "ii : 
ence, we require evidence as to how often we go wrong in assuming that 


" Person with high Pd has this vaguely defined pattern of arrogant, unruly, 
"responsible attitudes. 
The facts required to assess the adequacy of descriptions are seriously in- 
complete, and many of the findings strike a pessimistic note. Gough (1957) 
Correlated a number of CPI scores with ratings of students made by a staff 
of Psychological assessors. These ratings are based on comprehensive psy- 
chological study and provide a reasonable criterion to test the statement 
‘at Persons with certain scores tend to be seen in certain ways. The correla- 
tions between CPI scores and the ratings to which they supposedly relate 
range from .21 to .48. Such modest correlations warn against depending 
ption from the CPI. 
description of a whole personality by the 
The personality description covers 


9n any single aspect of the descri 
© may not, however, judge a 
ities of the scales taken separately. 
o hd dimensions, and a little information about each feature wi add up 
Ong, eling portrait. Moreover, considering the je ade ol — s 
inter possibly permits much more accurate I a Hen t = ot e-scale 
that Dretations for which Gough gives validity — ea A _ 
bits e interpretation of one score depends upon the level of anot Sk Hie 
Der shown in Figure 83. The interpretation might be further modified if 


Valid 
Man 


er 
J a Were taken into account. . 
Sea] arly needed, at this point, is evidence not now avai sir 
Ss, pairs of scales, or whole profiles, the interpreter should divide cases 
Tee piles according to whether he regards them as strikingly high, 


lable. Using single 


into 
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strikingly low, or in the middle range on some trait (e.g., compliant). This 
classification could be correlated with ratings by others who know the per- 
son well. 

From the evidence now available (see also p. 592) we must continue to 
regard descriptive interpretations as hazardous. Those familiar with a par- 


Ac high 
compliant efficient 
industrious mature 
moderate organized 
quiet stable 
Ailow Ai high 
awkward demanding 
coarse dominant 
self-defensive independent 
shallow sharp-witted 
Ac low 


FIG, 83. Proposed pattern interpretation of CPI scores labeled Achievement through 
conformity (Ac) and Achievement through independence (Ai) (Gough, 1957). 


ticular test often believe that it gives them clear pictures of personality. This 
may be a self-delusion nurtured by recall of successful cases, but one cannot 
deny that many inventories measure individual differences reliably and that 
those differences have some relation to personality as observed in other ways: 
When the description from the test is a point of departure for further study 
of the individual, errors of interpretation can be corrected. Under no circum 
stances should such a description be passed on to a school principal, an em- 


ployer, or any other decision maker not trained to check the interpretation 
critically against other evidence. 


23. Grayson diagnosed his case as “i 
nosis would be suggested from 
made to interpret the dynamics 


ncipient paranoid schizophrenia." Whot ro 
the peak MMPI scores if no effort had 
of the personality as a whole? 


Establishment of Scientific Laws 


A recent development is the employment of personality measures in th 


e 
establishment of psychological theory. Test scores are interpreted in term$ a 
theoretical concepts and related to behavior under various experiment? 
conditions. The outcome of such experiments is, first, interpretation of 
test in terms of a refined concept rather than an ambiguous or arbitrary 


H ne 
trait name and, second, development of theory as to the significance we 
trait. In addition to the study of Taylor scores and eyelid conditioning " 
marized earlier, we may 


in (1951) 
mention the more elaborate study by Cervi? (19 
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testing the hypothesis that in a two-person discussion the more emotionally 
Tesponsive subject will take more initiative, participate more, and change 
Opinion less. In order to bring out this effect, he found it necessary to use a 
Specially purified measure of emotional responsiveness (akin to anxiety or 


neuroticism) rather than a published questionnaire. In experimental groups 


f : : : 
9rmed by pairing high scorers with low scorers, the predicted differences 


Were found in about 80 percent of the cases. 

Relations of this type can be considered well established only when con- 
firmed by other investigators. As more such relations are verified and 
Woven into psychological theory, tests will come to have an important role in 
as in practical decisions. Theoretical clarifi- 


t ; 
© Science of psychology as well 
gh these may be far in 


cati P 
ton should also have practical consequences, thou 
e future, 


REPRESENTATIVE PERSONALITY INVENTORIES 


The following inventories illustrate the variety among currently published 


"ventories, but by no means exhaust the field: 
Ro Billett-Starr Youth Problems Inventory; Roy O. Billett and Irving S. 
A ar; World Book, 1958. Grades 7-9, 10-12. A problem checklist covering 
uch areas as health, boy-girl relationships, personal finance, and planning 


°r the future, Designed for general screening of pupils for individual study, 


an i y j 
E for identification of common problems to be taken up in group guid- 
ce, 


- Hi California Personality Inventory; Harrison G. Gough; Consulting Psy- 
.OBists Press, 1957. High school. A lengthy inventory covering fifteen 
aits such as sociability, tolerance, and intellectual efficiency, plus three con- 
Sn The scoring keys were developed empirically but have rather low 
pias Oig with their criteria. Interpretation is based primarily on an im- 
lonistic psychological integration of the entire profile. The profile 


Coy N y 
ers Personality more broadly than most other inventories, but scores often 


intere é 
Orre : 
y rrelate too highly for efficient measurement. Interpretation has not 


cun Bde ized and validated. 
e c. equately standardized and Y = 
Nest California Test of Personality; Louis P. Thorpe, Willis W. Clark, Er- 


as Tiegs; California Test Bureau, 1942, 1953. Primary, elementary, sec- 
“ty, and adult forms. A questionnaire yielding percentile scores on per- 


Son « 
hen adjustment and social adjustment. Such subscores as “sense of 
ist 9nal Worth," “nervous symptoms," and "family relations" have skewed 
ributi aningful information about pat- 


Ons and are capable of giving me 
es. The evidence on validity presented 


leading in places. The manual attempts 
f mental hygiene for teachers, and the 


Erns 
» [e] ; 
in adjustment only in rare cas 


em M i 
anual is incomplete, and mis 


to 

S 
mma; 
marize theory and practices o 
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necessarily brief presentation runs considerable risk of being misinterpreted 
and misapplied. ; 

€ Edwards Personal Preference Schedule; Allen L. Edwards; Psychologi- 
cal Corporation, 1954, 1959. High school, adult. The 225 paired comparisons 
lead to scores on 15 "needs." Designed so as to eliminate “façade” effects, but 
patterns can be faked. The scale is recent and research is limited. Gives 4 
description likely to be helpful in counseling. 

€ Gordon Personal Profile and Gordon Personal Inventory; Leonard V. 
Gordon; World Book, 1953, 1956. High school to adult. Uses eighteen to 
twenty forced-choice items in each form. The profile measures ascendancy, 
responsibility, emotional stability, sociability, the Inventory measures cau 
tiousness, original thinking, personal relations, vigor. Either can be given 1n 
fifteen minutes, yet reliability of scores is about .83. An efficient instrument 
for obtaining a self-description profile. Evidence regarding significance O 
scores is extremely limited but encouraging. 

€ Guilford-Zimmerman Temperament Survey; J. P. Guilford and Wayne 
Zimmerman; Sheridan Supply Company, 1949. Adolescent and adult. 
Measures ten relatively independent traits defined through factor analysis, 
including ascendance, sociability, thoughtfulness, objectivity, and restraint. 


A more efficient version of the earlier Guilford scales. A typical descriptive 


instrument. Little evidence on significance of scores is available. 


@ Kuder Preference Record, Form A—Personal; G. Frederic Kuder; 
Science Research Associates, 1948, 1953, Adolescent and adult. A companion 
to the vocational interest inventory, this set of forced-choice items measure? 
preference among sociable, intellectual, etc., activities. (See Figure 81.) 
Occupational patterns have been collected which enhance the usefulness © 
the scale in vocational guidance, but little is known about the scale as 4 pe 
scriptive or diagnostic instrument. It is free from facade effect, but patterns 
can be faked. Since scores have no obvious “good-bad” implications, the 


Kuder is likely to be suitable for introducing counseling, especially in "ad 
school. 


* Minnesota Counselin 


g Inventory; Ralph F. Berdie and Wilbur L. L2" 
ton; Psychological Corpor: 


ation, 1957. High school. Many of the 413 true-fals? 
self-description statements are rewritten MMPI items. Seven scales measur" 
adjustment to family and social relations, emotional stability, mood, c0” 
formity, ete. Two control keys are provided. The scales have positive e 
very modest validity for separating (for example) pupils known to a 
poor family adjustment from those rated as having good adjustment. Re 


7 i e 

d three months show reliabilities in the .70-.80 range. Interpretatio? re 

is instrument will remain uncertain until considerably greater experie 
and validating evidenc 


x e have been accumulated. C 
9 Minnesota Multiphasic Personality Inventory; S. R. Hathaway and J: 


tests 
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McKinley; Psychological Corporation, 1943, 1951. Late adolescents and 
adults. (Sce pp. 469 ff.) 

* Mooney Problem Check Lists; Ross L. Mooney and Leonard V. Gordon; 
Psychological Corporation, 1948, 1950. Forms for junior high through col- 
lege, and adult. Subject checks his problems in eleven fields: morals and 
religion, finances and living conditions, adjustment to school work, social re- 
lations, ete, High scores identify those who should receive counseling, and 
items checked provide a basis for individual or class discussion. 

9 SRA Youth Inventory; H. H. Remmers and Benjamin Shimberg; Science 
Research Associates, 1949. High school; also, SRA Junior Inventory for 
Grades 4-8, 1955. A checklist of unusually efficient format covering typical 
Adolescent problems of educational and vocational planning, and social and 
€motiona] adjustment. Chiefly useful as a starting point for individual and 
Stroup guidance; may also be used as a screening inventory to detect indi- 
viduals requiring intensive study. . 

9 The 16 P, F., Test; R. B. Cattell, D. R. Saunders, and Glen Stice; In- 
Stitute for Personality and Ability Testing, 1950. Age 16 and over. Sixteen 
Scores measure dimensions such as dominance, general intelligence, emo- 
tional Stability radicalism, and will control. The dimensions are relatively 
independent And have some advantages for research purposes. The short 
Scales have extremely low reliability (.45-.55) and the information on norms 
As Unsatisfactory, Not recommended for assessment of individuals. ( Versions 
9f the test for various school ages are either available or in preparation. ) 

. Study of Values; Gordon S. Allport, Philip E. Vernon, Gardner Lindzey; 
Oughton Mifflin, 1931, 1951. Later adolescence and college. Forced choice 
“tween preferred Ma and beliefs. Scored according to Spranger's sys- 
fem to indicate relative emphasis on Theoretical, Economic, Political, 
ins Stie, Social, and Religious values. Of some value as a supplement to 

Crest inventories in vocational guidance; much used for research in social 
Psychology, 

x Survey of Study Habits and Attitudes; Wm. F. Brown and Wayne H. 

o tzman; Psychological Corporation, 1953. College students. Covers study 
fa avior and attitudes (e.g., “Whether I like a course or not, I still work hard 
ud * a good grade"). Out of 75 items on this questionnaire about half are 
thee these distinguish students with good marks from those who do poorly . 
S score correlates about .45 with grades and, combined with an ability 
lds a predictive validity of about 60. The test is fakable and is not 
ended as an admission test. Both the total score and the item re- 


; -study courses. 
are useful in counseling and in how-to-study 


H 


est, yie 
"comm 
SPonses 


bree levels of test were mentioned in Chapter 1: Level A, appropriate 
Se by teachers and others without special training in testing; Level B, fon 
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use by counselors and others with a good general understanding of Bae 
and Level C, for use by persons with considerable psychological aie 3 
and relevant supervised experience. Most personality tests belong to S 
higher levels, because their interpretation requires considerable la am . 
There is some risk that the interpretation will create difficulties which p y 
a professional counselor or clinical psychologist is likely to recognize. a 
example, to tell a subject that he is low in emotional stability may aggrava , 
his difficulties. Giving the same facts to his employer—no matter how ond 
tiously presented—may blight his chances of promotion and ultimately - 
crease his maladjustment. Allowing a test to damage the person's E. en 
ties and self-satisfaction would be dubious even if the test were highly vali 4 
Since it is not, the results of personality tests should generally be — 
only to professional workers who know their limitations. In the light of thes' 


sg sts listed 
dangers, the author suggests the following categorization of the tests liste 
above: 


A. Can safely be interpreted by teachers: 

€ Mooney, SRA, and Billett-Starr inventories. Teachers should not 7 
tempt to analyze individual test scores, and unless a counselor is to interpre 
individual records pupil’s answer sheets should probably be unsigned. j 
tabulation of the frequency of particular problems is an excellent basis g 
group guidance, curriculum planning, and modification of school conditions 
which create problems. 


B. Can safely be used by the counselor with basic training in vocational 
and educational counseling, 

. Allport-Vernon-Lindzey, Kuder Personal. These inventories reflect m 
person's preferred choices and in that respect resemble interest inventories: 
Interpreting the scores is unlikely to threaten self-esteem. tif 

© Billett-Starr, Mooney, SRA inventories. These may be used to iden y 
pupils for interviewing, and as a starting point for interviewing. Little 
tion should be paid to the scores themselves. The counselor should not 4 


not 
Sai to resolve deep emotional conflicts that call for experience he does 
ave. 


B’. Can safely be used by counselors with considerable training in po 
sonality theory and handling of emotional conflict. se 
© Bell, California (CTP), Edwards, and Gordon inventories. 1 t 
should ordinarily be interpreted as a part of an individual case study; 6^ a 
in research studies it is rarely advisable to apply them routinely to eae 
Bell, CTP, and Gordon may threaten the in 5 in- 
consider carefully before deciding whether 
bject. 


Since some scores in the 
ual, the counselor should 
terpret the test to the su 


. nical 
C. Require comprehensive training in counseling psychology or cli? 


PERSONALITY MEASUREMENT THROUGH SELF-REPORT 499 


psychology, including understanding of test theory, personality theory, and 
handling of emotional conflict. 

* CPI, Guilford-Zimmerman, Minnesota Counseling Inventory, MMPI. 
Further reports of research on the meanings of profiles on some of these in- 
Ventories may ultimately make them trustworthy in the hands of counselors 
at Level B or B’. Most promising in this respect is the relatively simple 
MCL. This instrument is not difficult to interpret, but its use in schools carries 
Some risk. The Leadership score, for example, is not very valid; it may be 
that if such a score is made available to teachers they will give leadership 
OPportunities to pupils with high scores and deprive pupils with low scores 


of this valuable learning experience. 

whose responses resemble those of other 
hat characteristics other than leadership 
guish student leaders in high school from 


24. A Leadership score identifies pupils 
Pupils who have become leaders. W 
ability and interest are likely to distin 
the students who take little part in student affairs? 
If school officials make use of leadership scores in encouraging certain pupils 
to take leadership responsibilities, will this tend to increase or decrease the 
Correlation between the original scores and leadership record by the end of 
high school? 

Scores on certain instruments purport to i 


makers and potential delinquents. Assuming 
Stability and palates what use might be made of such a test by high schools? 


If, as is the case, the validity coefficients are quite low, what undesirable ef- 


27 fects may follow if such scores are collected by principals? 
* The restrictions on use of personality inventories in counseling suggested above 


are admittedly conservative. Some psychologists argue that it is unwise for 
Counselors to “imitate the secrecy of the medical profession” in withholding 
Scores from teachers and other laymen. These psychologists argue that laymen 
Continually make judgments about personality, and that if discouraged from 
Using test scores they will base their judgments on casual observations of even 


ess validity than the tests. What do you think? 


25, 


26. dentify students likely to be trouble- 


that such a score has very high 


'DIOGRAPHIC ANALYSIS OF THE SINGLE PERSONALITY 


Critic: 
riticisms of the Concept of “Trait” 
individual, there must be a char- 


est is to assign a rank or score to the 
5 re is located. The desire for scales 


Cterist; A 
a Stic or dimension along which this sco 
Nalo 


defined class of stimuli. Traits are 
arly all the adjectives which ap- 
conventional, stubborn, and 
wever, and are defined and 


Sep] a defined way in response to à 
pl a embedded in Western languages; ne 
on iue are descriptive of traits: happy: 
easy Taits are elusive in scientific analysis, ho 

“ured only at the risk of some ambiguity. 


So 
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The postulate that traits exist is supported by three v m — 

9 Personalities possess considerable consisteney; n pensons 
habitual reactions over a wide range of similar situations; —— 

@ For any habit, we can find among people a variation of deg 

unts of this behavior. . stain 

a Personalities have some stability, since the person caraning a ce 
score this year usually has a somewhat similar score next tm able of 

These facts lead one to consider personality traits as habits, ca gero " 
being evoked by a wide range of situations. It would be tedious to » pretty 
Series of traits such as “habit of bowing politely when meeting E olitely 
woman of one's own age on the street on Sunday,” “habit of oisi cem d 
when meeting a not-pretty woman . . . ," etc. Therefore watts — Mun 
which describe consistent behavior in a wide range of addi sehen 
approach to personality hopes to describe economically the significa 


: 16” 
tions of behavior, neglecting unduly specific habits. Since the English d 
tionary offers no less than 17,953 ad 
economy is a serious one, 

A trait is a composite of many specific behay 


fectly honest predicts his behavior in 
narily, ; i 


2 » problem of 
jectives describing traits, the prob 


iors. To say that a boy pes 
any situation involving oi onest 
rmediate degree of a trait; he is 5 not 
. Two people with the same score nee 


» implies an 
ty. Saying that a boy is “50 percent honest” implie 
honest or dishonest beh 


: same 

ollected under the trait definition are present in the 

person, i.e., when all Scores are 100 Percent or zero. ent. 
“The normal Personality” presents tr 


oublesome problems of ne 
istribution is well characterize betae 
degree and in a large number 0 orma 
ores tell the investigator little. Yet every mal" in 
que characteristics. Even a person who is “normé to 4 
ure has individuality, Reducing his performance 


t- 

ional, we lose P^ 

Ves not to be exceptional, 
different from his also- 


his 
The deviate at either 

tions. Intermediate se 
personality has its unj 
all the traits we meas 


criticized the entire trait approac® 
need; he will take money to feed Bs he wW! 
T may be prudent rather than yc aet 
d being caught. Another may define erate ? 
Ver steal, but he thinks it right to Rr to a” 
uyer beware.” These men are all vc 

ally, their honesty is near the average 


to avoi 


uld ne 
Principle of “h 
intermediate degree; Statistic, 
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Since mapping a personality in terms of a few common traits does not rep- 
Tesent the way the individual's behavior is organized, many investigators 
have tried to develop what Allport calls an “idiographic” description. An idio- 
8taphic analysis would define new traits as needed to fit each individual 
(e.g., "shy with women of his own age in non-business relationships"). The 
difficulties that face such efforts are enormous, but some initial steps have 

een taken successfully. 

The trait approach describes responses as if they were general over a very 
large class of situations. “Dominant,” “paranoid-like,” and “honest” describe 
responses independent of particular situations. The idiographic approach 
looks for equivalences among situations. Sometimes student X shows 
dominance, sometimes not. If we can find out what situations bring out 
dominant reactions—i.e., are equivalent for him—we can then hope to pre- 
dict his behavior with some exactness. 

As a first step in studying situational equivalences, C. E. Osgood and 
S: A. Kelly have developed techniques for studying perception of the sig- 
nificant persons in a subject's life. These others are an important part of the 
nam world, and many reactions are determined by his perception of 

lem, 


28, Show how "stubbornness" might be present in some situations and absent in 
others for the same person, even though both actions are typical for him. 


T 
he Semantic Differential 


Osgood's method was developed for research on perception, meaning, and 
attitudes, rather than as a personality test (Osgood et al., 1957). Known as 
9 Semantic Differential, it measures indirectly the connotations of words 
"i jects. The stimulus is rated on a seven-point scale, various scales and 


Stimul; IE > à 
muli being mixed in random order. Successive items might appear as fol- 
Ows. j 


m FATHER soft hard 
RAUD rich E ; : ; 3 : poor 
Conruston faik s : : i : E 

Y FATHER deep shallow 


Most Studies Osgood and his students have been interested in specific 
nie (eg, “physicians,” “Presidential candidate A”) as perceived by a 
Se Stroup. For examining an individual, Osgood employs stimuli of per- 
na] Significance, e.g, "my father." : . . . 
Nate Subject is to check the scale rapidly, recording Tis ew HBIESSIGDS, 
i rally it is difficult to defend any single response as right when judging 
Munism on the scale thin-thick, but subjects have little difficulty in 
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checking associations. The scoring can be accomplished in two ways. Using 
factor analysis, the scales can be grouped into good-bad, strong-weak, and 
active-passive keys. Average scores can be assigned for each stimulus. Thus 
we could say that a subject has indirectly described his father at +1 on 
good (on a scale from +3 to —3), 2.4 on strong, —0.4 on active. The other 
Scoring method compares stimuli two at a time, converting the differences 
between their ratings into a "distance score" measuring the degree to which 
the subject perceives the stimuli as similar. 

The best illustration of the technique is its application to a case of triple 
personality. A dissociated personality is one in which the person possesses 
two or more different "selves" and shifts back and forth between them (a bit 
like Dr. Jekyll and Mr. Hyde). Eve White had three such identities who 
"took possession" at various times, and her therapists were able to administer 
the Semantic Differential to each self in turn (Thigpen and Cleckley, 1953, 
1937). In Figure 84 we present the configurations from two of the tests. The 


Sickness 


Eve White 


Eve Black d 
FIG. 84, i : f . isa 
bris nen ob Systems of Eve White and Eve Black on the Semantic Differential (Osgoo 


tess 
black ball represents the midpoint on all scales. “Good” is at the top; hs 
tive” at the left, and “weak” toward the viewer. The solid line connecting 


black ball with “doctor” (who is always good, strong, very active) helps 5 
orient the figure, 


T 
Two psychologists interpreted the patterns “blindly,” i.e., with no paria 
knowledge of the cases, Looking first at a few salient indicators, they poi? 
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Out in Eve White's record the separation of love and sex, the meaningless- 
Dess of the spouse, the weakness of “me.” Eve Black seems to place hatred 
and fraud in a favorable cluster with “me” and rejects spouse, love, job, and 
child. (The pattern of the third self, Jane, which will not be discussed here, 
is normal; love and sex are closely linked and favorable.) An impressionistic 
"guess" by the interpreter led to this summary of the first two personalities 


(Osgood and Luria, 1954): 


Eve White is the woman who is simultaneously most in contact with social real- 
ity and under the greatest emotional stress. She is aware of both the demands of 
Society and her own inadequacies in meeting them. She sees herself as a passive 
Weakling and is also consciously aware of the discord in her sexual life, drawing 
increasingly sharp distinctions between love as an idealized notion and sex as a 
crude reality, She maintains the greatest diversity among the meanings of various 
concepts. She is concerned and ambivalent about her child, but apparently is not 
aware of her own ambivalent attitudes toward her mother. . . . Those psycho- 
analytically inclined may wish to identify Eve White with dominance of the 
Superego; certainly, the superego seems to view the world through the eyes of Eve 

hite, accepting the mores or values of others (particularly her mother) but con- 


tinuously eritis Ys If 
y criticizing and punishing herself. . . . — . . . 
Eve Black is sicay bs most out of contact with social reality and simultane- 


ously the most self-assured. To rhapsodize, Eve Black finds Peace of mind through 
Close identification with a God-like therapist (My Doctor, probably a father symbol 
for her), accepting her Hatred and Fraud as perfectly legitimate aspects of the 
God-like role, Naturally, she sees herself as a dominant, active wonder-woman and 


I$ in no wa self-critical. She is probably unaware of her family situation. . . . 
Like 4 iple selfish infant, this personality is entirely oriented around the 


as i E : 
Sumption of its own perfection. 


The pattern corresponds well with the therapists’ picture of Eve. The 
therapist, described the same personalities in these phrases, among others: 


Eve White: demure. almost saintly, seldom lively; tries not to blame her husband 
for marita] troubles; every act demonstrates sacrifice for her little girl; meek, 
; 


fragile, q 
s > doomed to be overcome. j i iml 
Ve Black. a party girl, shrewd, egocentric; rowdy wit; all attitudes whimlike; 


ready for any little irresponsible adventure; provocative; strangely secure from 
Inner aspect of grief and tragedy. 
The Correspondence of the portraits is remarkable. A single brilliant hit, 
Wever, is not to be regarded as adequate evidence of validity. 
that *pping stimulus equivalences gives a different type of mayer from 
of the trait-oriented questionnaire, but the semantic TAP SSeS Eve In- 
“™ation about traits, Eve White is unquestionably dissatisfied with her- 
> Perfectionist, unwilling to express emotion—any questionnaire would 
OW a high score on introversion and hysteric tendencies. The Semantic 
a ferentia] adds information about specific sources of conflict: lack of 
Ceptance of spouse and sex, and her child's weakness and need for protec- 
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tion. (Indeed, Mrs. White's fecling that she could not give her ir ced 
quate protection was a precipitating cause of her illness.) Eve Bise * 3 : z 
low, uncontrolled, self-centered—extrovert on any questionnaire an " 
MMPI surely extreme on Ma and Pd. This is apparent in the Semantic vc 
ferential association of me, hatred, and fraud as "good and strong." But A 
map gives the additional picture of strong identification with men ~ at 
jection of child and spouse. One can judge what persons and ae l 
likely to win her respect and coöperation, and what rewards she is li «i 
to work for. Such information goes far beyond what one can get from m s 
the most valid description of her general personality style. Uhe geward ] E 
a therapist might ordinarily offer—opportunity to hold a job, pci 
of marriage—were spurned by Eve Black. The coóperation that emp y 
permitted some success in therapy was won only when the therapist ap 
pealed to Eve Black’s fear of sickness. — 
The Role Concept Repertory (Rep) Test of G. A. Kelly (1955) is st be 
like Osgood’s procedure, save that the subject himself now picks a E 
on which he will respond. This device reflects Kelly's theory of social b 
havior and psychotherapy, which pl 
conceptualizations shape his beh 
to obtain information useful to 


erson's 
aces great stress on the way the pss 
avior. The principal aim of the Rep tes 
a therapist. 


The subject is given a list of about twenty roles, of which the following 
are representative: 


Your wife or present girl friend 
Your mother 


A person with whom you have worked who was e 
A girl you did not like wh 


The person whom 
Sorry for) 


asy to get along with 
en you were in high school 


t 
you would most like to be of help to (or whom you feel mos 


The subject names the people who fill these roles for him. The examiner me 
selects three of the persons and asks, “In what important way are two ial 
them alike but different from the third?” If the response is sped. 
(“These two are tall”) the examiner asks for some further similarity. A pue ê 
response might be, “These two are self-confident and this one is shy. 
subject has then stated a bipolar scale 
differ. The procedure is continued un 
and applied to the significant others. 


29. Is the Semantic Differentia 
attitudes? 


ations 
30. How might the Semantic Differential be used to study transference ral 
during psychotherapy? 


31. Osgood finds only three 
mensions appear ade 
32. Is Osgood's test prim 


to 
along which he perceives pus i 
til many scales have been elici 


jous 
r nscioU 
| fakable? Can one argue that it assesses unco 


di- 
" three 
predominant factors among his scales. Pa 

quate to describe one's perception of others? -— 

arily behavioristic or phenomenological in outlo 


PERSONALITY MEASUREMENT THROUGH SELF-REPORT 505 


Suggested Readings 


Diamond, Solomon. The factorial approach. Personality and temperament. New 


York: Harper, 1957. Pp. 151-183. 
This review attempts to classify dimensions 
considers lists of possibly important traits and 
planation of factor-analytic procedures. 
Hathaway, Starke R, & Monachesi, Elio D. Personality characteristics of ad- 


Olescents as related to their later careers. IL. Two-year follow-up on delin- 


quency. Analyzing and predicting juvenile delinquency on the MMPI. Minne- 
apolis: University of Minnesota Press, 1953. Pp. 109-135. 
This massive study of score patterns indicative of delinquency shows both 
the advantages and the disadvantages of analyzing combinations of scores. 
The first and last chapters of the book deal with the practical meaning of the 
research, 
Meehl, Paul E., & Hathaway, St 


of personality by factor analysis, 
also gives an introductory ex- 


arke R. The K factor as a suppressor variable 

in the MMPI, J. appl. Psychol., 1946, 30, 525-564. (Reprinted in G. S. Welsh 

and W. C, Dahlstrom (eds.), Basic readings on the MMPI in psychology and 

Medicine, Minneapolis: University of Minnesota Press, 1956. Pp. 12-40). 
Several theoretical aspects of the development of MMPI are discussed, in- 
cluding the authors’ reasons for not forming homogeneous clusters of items 
_ and the need for corrections for bias in self-reports. —— . 

Chiele, B. C., & Brozek, Jozef. "Experimental neurosis resulting from semi- 

Starvation in man. Psychosom. Med., 1948, 10, 31-50. (Reprinted in Welsh and 


Dahlstrom, op. cit. Pp. 461—483.) 

M a study of MMPI changes during an 
uration, nine cases are described in det: 
tween MMPI avior patterns. 

API profiles and behavior pà e 

y mann, Charles A Teachers, peers, and tests as predictors of adjustment. J. 

educ. Psychol., 1957, 48, 257-267. 

“vidence is given on the differen 

ratings and in self-report questionr 

Out of school or who perform poo 
Such students are listed. 


experimental stress of six months' 
ail, showing the relationship be- 


ce between information contained in teacher 
naires. Attention focuses on pupils who drop 


rly in school. Items capable of identifying 


ov 


Judgments and Systematic 
Observations 


g it 
WHETHER an individual's reputation corresponds to his behavior E -— 
is unquestionably significant. A person who has impressed rs ii 
ers as imaginative is favored by a college admissions committee. dra 
and military organizations file Supervisors' opinions and Hm m chote 
ing whom to promote. Teachers find out what children think o e soal 
in order to understand relationships in the classroom and to identi T sin n 
misfits. Furthermore, as we have seen, ratings are an important A à! 
for studying job performance and adjustment. In this chapter we ape^ 
Sider problems and techniques of obtaining ratings by oe a died 
peers (companions at the same level in the organization). We sha 
turn to systematic observations of behavior. 


RATINGS AND SOCIOMETRIC REPORTS 
Ratings by Supervisors ud 
etc. 

Descriptions by Supervisors (foremen, teachers, superior dim there- 
are hard to compare because styles of writing vary. Rating scales s consists 
fore used to reduce impressions to manageable form. A rating sca simple 
of a list of traits to be rated. The form of the scale may vary = pue 
list of adjectives to be checked to a continuous scale with Seve forms 9 
tive labels, as illustrated in Figure 85. Before evaluating specific 
rating scale, let ug consider the chief difficulties to be pi pm of 

Sources of Error, The first problem is generosity error, ie, the ten a report 
raters to give favorable reports. The teacher, asked to indicate es the most 
card whether the Pupil is coóperative, will usually rate all oe eine 
troublesome pupils at the highest point on the scale. Company t of five) 
rate 98 percent of their junior officers in the top two categories (o 

506 


Name of student. 


C Sought by others — | Please record hero instances on which you base your judgment. 


A—How are 
you and oth- 
ers affected 


O Wellliked by others 
by his ap- 


Pearance i 
DR CERES Liked by others 
neri Tolerated by others 
Avoided by others 
D] Ne opportunity to 
observe 
B—Does he |C] Seeks and sets for | Please record here instances on which you base your judgment, 
need fre- himself additional 
Quent prod-| — tasks 
ding or does Completes sug- 
e go ahead gested supplemen- 
without be- tary work 
ing told? Does ordinary os- 
signments of his own 
accord 
C Needs occasional 
prodding 
C) Needs much prod- 
ding in doing ordi- 
nary assignments 
O No opportunity to 
observe 
= n 
Does he C Displays — marked | Please record here instances on which you base your judgment. 
Get others to ability to lead 
Aie aat he his fellows; makes 
ex things go 
O Sometimes leads in 
important affairs 
O Sometimes leads in 
minor affairs 
C Lets others take lead 
L] Probably unable to 
lead his fellows 
O No opportunity to 
observe 
ki — How C Unusual balance | Please record here instances on which you base your judgment. 
fore of responsive- 
tions? emo- ness and control 


[] Well balanced 
E] Usually well 


balanced 
O Tends to [C] Tends 
be un- to be 
respon- over 
sive emo- 
tional 
O Unre- O Too 


sponsive, easily 
apathetic de- 
pressed, 
irni- 
tated or 
elated 
O No opportunity to 
observe 


'e instances on which you base your judgment, 


i 
Has he a | C] Engrossed in real- | Please record her 


r 
With dengi | drimg well formu- 
Purpose nite lated objectives 
ers a O Directs energies 
Which he d f| effectively with 
tribut is-| fairly defnite pro- 
time m his gram 

ergyp 7 Has vaguely formed 

objectives 


Aims just to "get by" 
[] Aimless trifier 

No opportunity to 

observe 


FIG, 85. The ACE Personality Report, Form B. (Reproduced by permission of Ameri 


c P 
an Council on Education.) 
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on efficiency reports. Such ratings have little value because thay ed 
criminate between individuals. There are several reasons for i pes a it 
rors: the rater may feel that he is admitting poor leadership : E ier m 
his subordinates are not performing well; he tends to feel lnd y D € se 
associates; he thinks he may have to justify any implied presi pura 
often finds it easier to say good about everyone than to pause 
careful discriminations. . m 
Ambiguity is a second difficulty. Just as a self-report ganm 
ship can be variously interpreted, so a rater may define leac ers Pare 
ways. To one judge “leadership” suggests conscious Mie e this judge 
crisp decisions, and general dominance. A person rated high 3 1 som 
would receive a lower rating from a judge who looks for a lenger he cgit 
age subordinates, bring out coóperative decisions, and subordinate 
views to the decision of the group. —" 
The rater is usually instructed to mark one of several alternative s ih of ms 
tions, and these response positions may also be ambiguous. In — hus te 
early rating scales the respondent was asked to rate — € M 
ample, on a scale from 0 to 100. No particular definition can be giv re dif- 
number such as 85 on that scale, and the same score may indicate qui 


are 
i sai 3 ccellent a 
ferent behavior to different raters. Such words as average and exc 


-tons of be- 
equally indefinite, They should be replaced by specific descriptions O 
havior. 


Judges have constant errors or biases. A const 
when two judges rate the s 


they are observing different 
ferently. Generosity is one 


ant error can be un 
ame individuals. If the judges averages 3 dif- 
aspects of behavior or are defining the Pes men- 
such constant error. The response i dipl 
tioned in connection with achievement tests and personality mne the 
are also observed in ratings; e.g., one judge rarely uses the aeaa in 
scale in describing subjects, whereas another describes most per 
black-and-white terms, limited # 
À further source of differences between judges is that each has oY and à 
formation about the individual. Since a physical education teacher 4 


" on 
: ir ratings 
English teacher see entirely different sides of the student, their en an 
initiative, imagination, or re: 


le o 
Observer sees an individua 
behavior is stil] limited. Tl 
the man does under his sup 
of his work elsewhere, 
The so-called halo 
within the individual, 
son's merit, and his ra 
overall impression, E 


n- 
n 


action to frustration will disagree. Even ue 
l in a great variety of situations, me a wha 
ne supervisor can base his ratings only entative 
ervision, and this may not be at all repres 

n of trait 
effect is an error which obscures the pattern 
The observer forms a general opinion iu by 
tings on specific traits are strongly ae o 
ven productivity may be rated erroneously 


per 
this 
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the influence of a pleasing or displeasing personality. Halo is responsible for 
the substantial correlations shown in Table 66 among ratings given to 1100 
industrial employees. The ratings on quite dissimilar traits show a marked 


TABLE 66. Intercorrelations of Ratings Given 1100 Industrial Workers 


* "ME = 
$ or LP e5 3 t 3 £ 
£ > E t o o o g o 
$ = v $ 3E t£ = E o £ E 
m = = 5 g 5 25 t 5 E £& 9 = 
"S oO z ^ 5 d ot 5 BE 5 E ^ El 
S 5 292 5 $ 2 OS £ = 3 0 è = 
ux B Se $ d Š OF = -— T 
Safety 35° .61 52 63 55 60 49 54 62 6l 55 25 
Knowledge 
9f job di a di ds Z6 dh 9 JB 89 67 67 .5 


Accura c E . . E a a " 
y 63 85 .80 .45 8 ó 
"odudivity ss 79 72 81 .46 86 86 


Yerall job 

Perform. 

je 60 82 80 .67 86 46 .85 .83 88 .80 74 .60 
Ustrious- 3 s 2 ® 

wes as A7 82 84 .80 67 53 
Iitiative es 7 uy E e 183 .82 48 86 72 72 77 
Jdgment s : 88 .84 .86 .45 

Oöpera- 62 80 .82 .84 8! 

tion seo 72 76 .37 .80 .52 


ersonali 3 i s E 3 f E 3 . P 
Healy iss 167 43 70 73 Jo "A 77 a 5 71 36 


ments on cach worker, ie., reliability of 


a 
ing Bol x3 
judging face figures show correlations of two raters’ jud 


OuncE: 
CE: Ewart et al., 1941. 


Benera] f g to the foreman's opinion of the 


ui. oh apparently correspondin 
S industriousness and productivity. 
ae sources of error have four undesiral 
Pile Satis may not reveal important indivi 
P at the favorable end of the scale. . 
" Ratings may be seriously invalid, representing chance efecti or traits 
an the one supposedly rated. Lady et ap. cum 
5 of observation, for example. overrated the abi “i of mi ore 
e Dective, less outgoing personalities ( Barron, 1954). 
Halo effect obscures the descriptive picture. — 
T Ratings by different judges disagree. Evidence of e a i iy A susti 
si able 66. Reliability of rating is greatest for behaviors which can be c early 
“ified and for traits which are descriptive acc beste e icc, 
bis reliably rated include talkative, assertive, bashful, and cultured. Relia- 


ili + “ 
x is lowest for general, vaguely stated attributes such as adaptable, sen- 


Sitiy 
E and kindly (Hollingworth, 1922; Mays 1954). i I 
Provement of Ratings. The problems in improving ratings are similar to 


able consequences: 
dual differences because they 
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at 
the problems in improving self-reports. Again, the tester must ees ne 
the respondent will give false information if he thereby gains psy eus rm 
rewards. To be sure, the information affects the subject s future ue d 
the rater's, but this does not mean that the rater is uninvolved. e m 
mentioned the rater's inclination to interpret reports on the subject - CH 
flection of the adequacy of his own teaching or supervision. Bias is ev Meer 
certain when the therapist's ratings are used as a criterion of pers i 
change during psychotherapy. Sometimes a rater gives a low ia a : 
cause he wishes to retain an employee who might be promoted if he i in 
high rating. A teacher who rates a scholarship applicant may xD bg 
his merits nearly to the point of perjury in order to help the stu s € 
Selection of raters is the first point at which to improve ratings. mr 
cannot give valid information unless they know the subject well. ve 
things being equal, those in immediate contact with the subject can di 
better information than those who rely on hearsay. A high-school tea "d 
usually can give more dependable information on a pupil's work habits 4 
social behavior than can the principal. —T 
One elementary precaution, often overlooked in practice, is to ane nce 
the rating blank a question regarding the extent of t uainta 
with the subject and the kinds of situation in which o 
and a space where the rater can indicate 
serve” each trait instead of making an estim 


s dis 
The American Council scale (Figure 88) not only provides a space s pum 
cate lack of information but requests specific evidence for each rating by 
the reader can judge for himself whether a favorable rating is ioe 
the rater's knowledge. Ifa judge is directed to mark every trait, some "gaits 
are little better than guesses. Conrad (1932) directed raters to star 


n ter- 
which they regarded as especially important in the child's personality. ich 
judge correlations on all traits ranged from .6 


7 to .82. But for the traits 6 
three judges agreed in starring, the ratings correlated as high as — a 
When the same judge is used repeatedly, it may be possible to i 
record of his ratings and ultimately to esti 


4 xa " 

mate his constant error. oe and 

ple, a college learns to allow for the fact that one high school has à i 
grading or rating policy, 


acti 

whereas another school is lenient. It is rarely D eri 

cal to make exact Statistical corrections for such differences between cover 

One can raise the reliability of ratings by combining impressions i sag k 

judges. If, as in Table 66, the reliability of a rating is about .45, the e aver 

two independent judges is expected to have a reliability of .60 -— Spear 

age of five judges a reliability of .80. (These results are given by the nds t° 
man-Brown formula, P- 131.) In the average the bias of one judge p 
cancel the bias of another, and each adds information the other fad 


he rater’s acq 


ade, 
bservations were m: ^W 
to 0 


tcs : i 
opportunity ; 
insufficient opp ation- 


à ; n 
ate from inadequate inforr 


op” 
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portunity to observe. Reliability may be lowered rather than raised, however, 
When the additional judges are only remotely acquainted with the subject. 

Careful preparation of the rating scale is of great value. There is some ad- 
vantage in using several questions dealing with a particular aspect of per- 
sonality, just as there is advantage in basing self-report scores on many 
related items. On the other hand, anything which enlarges the rater’s task 
Invites perfunctory answers. 

The rater may be asked to make a simple checkmark beside satisfactory 


Qualities, respond on a numerical scale, or make choices among carefully de- 


Scribed alternatives. Where ratings on each trait are to be considered sepa- 
rately the last of these forms, known as the descriptive graphic rating scale, 
1S generally best (cf. Figure 86). The scale is descriptive, since each point 


I 
5 he abstracted or wide awake? 
| | 


Continuall | Usuall Wide- Keenly 
tl sually E 
a sorbed É oe present- awake alive and 
himself abstracted minded si 
(5) (4) (2) (1) ) 


1 
5 he shy or bold in social relationships? 
| 


Painfull d ious Confident Bold, 

Y Timid, Self-conscioU Com oic 
self-conscious Frequently on occasions in himself m 
(4) embarrassed qi (3) (5) 

H (2) 


o 
W does he accept authority? 


| 
| Ordinarily Respectful, Entirely resigned, 


Defiant Critical. of obedient Complies Accepts all 
authority by habit authority 
(5) (4) 3) ie > 
wo 96. Items for the Haggerty-Olson-Wickman Behavior Rating Schedule. (Copyright 1930 by 
"Id Book Company and reproduced by permission.) 


COtresponds to a reco gnizable behavior pattern. It is graphic, in that the rater 


55 allowed to mark at intermediate points if he does not find any one of the 
d ], 5- to 7-point scales seem to serve 


©Scriptions enti itable. In genera 

ade entirely suitable. In E^ ional judges, much fin: b- 

“€quately, With i an ous professional judges, er su 
ely. With informed and ser? P npo ed Marshall, 1939). 


lvisi 
‘Sions of the scale prove profitable (Ch : the “ ” checkli 
he 5-point scale obtains more discrimination than the "yes-no" checklist. 


^ judge will ordinarily say “yes” when asked DA ilia Bee Ap 
E Sment?” but if given several alternate alioieus he m a 
aa looks relevant facts in making decisions. ba ae j scale e e Te 
y nm of drawing attention to various kinds of Aerian. ape 

question, “Does he accept autho rity?” would not distinguish, as does the 
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Haggerty-Olson-Wickman scale, between the respectful, obedient child and 
the slavish, spiritless conformer. m 
To predict an external criterion, traits for rating can be selected empiri- 
cally. The Haggerty-Olson-Wickman scale is intended to screen maladjusted 
pupils for psychological study. A direct interpretation might be made simply 
by scoring socially desirable behavior. The investigators found, however, 
that behavior which on its face seems desirable may actually be a sign of 
maladjustment, appearing more often among problem children than among 
pupils in general. Weights were therefore assigned to each response, as indi- 
cated by the numbers in Figure 86. A score of 1 was given for descriptions 
rarely applied to problem children, and a score of 5 for responses character- 
istic of the problem group. We see, for example, that "Wide-awake" is a fa- 
vorable description, but that "Keenly alive and alert" describes children who 
get into trouble about as often as it does the well adjusted. ' 
This weighting technique suggests the possibility of concealing the scoring 
plan to outwit the rater who is unwilling to give an unfavorable report. One 
might count only those ratings which correlate with the criterion. For exam- 
ple, in selecting salesmen one might give credit for high ratings such as en 
ergetic, ambitious, and friendly (if these traits correlate with success in the 
job) and no credit for equally high ratings such as hard-working, well-ad- 
justed, and codperative (if these traits have no predictive value). Indeed, 
we may go farther, and assign a negative weight to these irrelevant favorable 
ratings to compensate for rater generosity. f 
Forced-Choice Methods. This idea underlies the forced-choice method p 
merit rating pioneered by the military services. Periodic ratings of each is 
ficer by his superior are required for use in promotion and reassignment. Ta 
tradition of giving favorable ratings, however, means that conventional ni 
ing forms bring in almost no information. Psychologists therefore invented d 
forced-choice scale. As a first step in making such a scale, superiors are iet 
to describe men by checking a list of phrases. A follow-up is then made to 
termine which men perform best in subsequent assignments, and for wre 
adjective or phrase two figures are obtained: a favorability index and a va * 
ity index. A favorable-valid item is one which raters apply frequently - 
which predicts success: an unfavorable-valid item is rarely applied and W n 
applied forecasts failure. Invalid items are those not associated with we 
or failure. my 
The forced-choice item is then developed. One technique used by At was 
psychologists employs two pairs of statements. A favorable-valid oe gl 
matched with a favorable-invalid item, and an unfavorable-valid item ater 
an unfavorable-invalid item. These four were presented together, the e 
being instructed to indicate the one statement which best describes the at 
and the one which least describes him. Thus the rater is forced to 77? 
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least one unfavorable statement, and to choose only one favorable statement. 
An item might consist of the following alternatives: 


(Favorable-valid) 
(Unfavorable-invalid) 
( Favorable-invalid) 
(Unfavorable-valid) 


Wins confidence of his men 
Inclined to gripe about conditions 
Punctual in completing reports 
Has weak tactical judgment 


s credit for each favorable-valid 
orable-valid choice. 
ater's task of describing 


The response is scored by assigning a plu 
choice and a minus credit for each unfav 

The aim in the forced choice is to separate the r 
What the individual does from the task of evaluating what he does ( Richard- 
Son, 1949), The responsibility for description must rest on the rater, but eval- 
‘ation is left to the decision maker. 

The score indicates the man’s probable merit. For a combat command, it 
Appears likely that winning confidence is more important than punctuality 
™ reports, and tactical judgment more important than contentment. The 
Scoring weights, however, are assigned on the basis of statistical evidence, 
Pot on the basis of judgment. The weights are kept secret from the raters, but 
a raters can dass to some extent how the scale will be scored, the 

por is only relatively free from distortion. 
tig and Berkshire (1951) compared several types of -s 
Pe ument for rating instructors. Whereas a graphic rating scale eir 
R4 :40 with rankings, validities for forced-choice scales ranged from .53 to 
ait The most valid form presented four favorable traits, two relevant to the 

erion and two irrelevant, with the rater instructed to mark the two most 
« Cribtive of the instructor. Such a form can be distorted by a desire to give 
el ratings, but only to a limited degree. When supervisors filled out the 

€ à second time with instructions to give as favorable an impression as 
ell near the 67th percentile of the 
the bias raised many scores from 
but did not lead to a piling up of 
r the rater to avoid giving a bad 


Dossi 
1 on the median of the “faked” scores f 

" - distribution. As Figure 87 shows. 
M ad” end of the scale to the average 


ery h; 
m i high Scores. It is evidently possible fo 
f Pression on this type of scale, but not to fake a very good one. Raters pre- 


“red the form using all favorable traits to the one using two favorable and 
“fa unfavorable traits. The latter was also more subject to distortion, since 

= Scores did pile up at the high end of that scale. 
ers are generally antagonistic to forced-choice techniques. They want 


Ow how their reports will be interpreted and want to be free to give an 


“Ntir, 
e ved- ici 1 
Rn ly favorable impression. Whether a forced choice scale can be used in 


a P : m" 
Ven situation depends upon the coóperation the data gatherer can antici- 


Shee upon the authority he can bring to bear. The Army, after developing 
®chnique and establishing its validity, concluded that resistance from 
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; i ales in ef- 
officers was too great to justify continued use of Tue A oe 
ciency reports. It has continued to use forced choice, howev _ ks agpll 
forms. In industry, forced-choice merit ratings have had consider: 
cation. m"—T 

A method of restricting raters which encounters less pes acqdueiit 
quire rankings. Where large groups of men are to be judged, the 


-—-—- Normal 


—— — "Faked good” 


=e 
-30 


0 10 


Unfavorable 
FIG. 87. Distribution of 
conditions (Highland and 


Favorable 


d faking 
ratings on a forced-choice scale under normal an 
Berkshire, 1951). 


as 
may call, not for com 

top 5 
tom 5 


; soups such 
plete ranking, but for dividing men into g! poi and bot 
percent, next 20 percent, middle 50 percent, next 20 percent 


n 
i jation U^. 
percent. This forced distribution obtains more differentia poses ae 
some circumstances than does the graphic scale. Ranking presuPF vera 
the judge is giving consideratio 


dability 
merit will be misleadin 


:natio?- 
When the institution Wishes to select men with initiative and ppt 
The chief limitation of the ranking method is that groups are rare y j 
ble, so that a top man in one group might rank tenth in anole (hl for 
The “Q-sort” technique developed by Stephenson (1953) is vapë y 
certain purposes. In comprehensive personality assessment, for : 2 am. 
terviewers and Observers may collect a great deal of Ma . m 5 
at a comprehensive Picture of the man’s strengths and weak” cal rat! k 
of this information jg lost if it is reduced to a few simple numer ma” 
In a descriptive re 


tant nding jng 
sc 
on’s method calls for the preparation of a set of phrase 
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the aspects of personality or performance that concern those who will use the 
Teport. There is no single list of statements for Q-sorting, since the tats to 
Consider in selecting executives may differ a good deal from the descriptions 
useful in appraising patients during psychotherapy. The following state- 
ments are representative of a list used by assessors for evaluating superior 
men (Block, 1957): 


Communicates ideas clearly and effectively 

Is rigid; inflexible in thought and action 

Takes an ascendant role in his relations with others 
Is masculine in his style and manner of behavior 


Lacks insight into his own motives and behavior . : -— 
Overcontrols his impulses; is inhibited; needlessly delays or denies gratification 


Allows personal bias, spite, or dogmatism to enter into his judgment of issues 


The statements or phrases are written on separate cards. The rater is told 
t sort the cards into eleven piles, with those most descriptive of the subject 
in the first pile and those least descriptive in the eleventh. The rater must 
Place a specified number of items in each pile; if there are 100 statements to 

© sorted, he might be told to put them into this distribution: 


ipti Least descriptive 

Most descriptive 
Pile 1928 4 g e 7 8 9 IO lt 
Number cf emd; Q 4 8 1b 16 19 10 7i 8 4 9 


The number of piles and the number of cards differ in different studies. 


© sorting procedure has some advantage over the usual rating form, 
Since the rater can shift items back and forth. In the usual inventory or 
Sh “Definitely true” may shift while 


ct the items placed in 


can easily arrange the items so as t À Eas i 
wards, 1957) ame cunt method may also be used in obtaining self-descrip- 
lns i li 


ys. One may compute the median 
dimension of personality, just as 
ance. One may develop an 


et data can be handled in several wa 
iti : : 

a » ‘on of statements representing a single dim 
P ne test is scored for anxiety or dominance, fonda $ 
s arial key for items predictive of a criterion, as in forced-choice rating 
Cales, O, ‘on showing how similar one subject is to 


Sus Criticism (Cronbach and Gleser. 1954). Properly designed Q statements, 


ining complex descripti 
ve unquestionable value for obtaining p. scriptions 


ic 
can be systematically compared. , 
© choice among rating techniques depends upon the purpose of rating, 
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the qualifications of raters, the information they have about the niji am 
the likelihood of distortion, deliberate or unconscious. The short, ms when 
but carefully prepared descriptive graphic rating scale is probably rid 
each subject is rated by different individuals and one may acis a s ngle 
able degree of honesty in rating. Ranking is advantageous when inn eur 
judge gives information on the complete group or a PRESIDE E e ins 
the group. The forced choice is often superior when ratings arg use vitello 
stitutional decisions regarding selection or classification but is less E vt 
for guidance or description of the individual. The Q sort is of Jp nd the 
where a comprehensive description of a single individual is desire a en 
rater can be expected to give patient consideration to a long list of en die 
Asking the rater to fill out a standard personality questionnaire so tha 
responses describe the subject has similar advantages. 


s? 
1. Which rating technique would be most suitable for each of these purpose 


ers 
a. Obtaining ratings from Principals to be used in deciding which teach 
should receive salary increases for special merit. 
b. Obtaining information for scho 
their children's personalities. 


at- 
€. Maintaining weekly records of ward behavior of patients as seen by 
tendants. n research 
d. Recording teacher characteristics as judged by an observer in r 
evaluating teacher-training methods. sed bY 
Obtaining reports from Supervisors of student teachers, to be U 
campus instructors in helping the student to improve. larship 
f. Obtaining reports on pupils to be used in awarding college schola 
the most deserving graduates in a state. hase rated 
2. Why might keenly alive children have more behavior problems than - be con- 
as wide-awake (Figure 86)? Would "keenly alive and alert" ordinarily 
sidered a sign of poor mental hygiene? i idi 
In the American Council rating scale, the trait scale for leadership (C) Te 
by five specific phrases. What advantage does this scale have N 
adjectives “excellent,” "good," "average," "poor," "unsatisfactory a 
Why might integrity and kindness be especially hard to rate reliably ate after 
Which of the following traits would probably be hardest to rate re from anxi- 
observations: skill in self-expression, freedom from tension, freedom 
ety, leadership (Hollingworth, 1922, P. 32)? d only 15 
6. Ratings on leadership made at Officer Candidate School correlat observed 
with ratings on efficiency of combat leadership by superior officers who l 
the men in combat (Jenkins, 1947). Why is the correlation so low? k the MMP 
- Could a complex description be obtained by having the rater mar P 


than 
isfactory 

responses that fit the subject? Would such a method be less satisfac 
Q sort? 


- The rating form shown in Fi 


colleges. Compare this form 
a. format. 


b. traits covered. 


€. adequacy of phrasing of scale positions. 


— 1 
3 ne ssions 
ol records regarding parents' impre 


e. 


s 10 


efined 
set of 


> 


o 
eports 

gure 88 is used by high schools to pian A 

with the ACE scale (Figure 85) with resp 
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Validity of Ratings by Superiors. It is extremely difficult to state pow hs 
a given situation, ratings by superiors will be valid measures of Bw E 
One might expect supervisors to rate job knowledge accurately. The ds 
is well defined, the behavior is observable, and the supervisor has "i 
opportunity to observe. Nonetheless, supervisors’ ratings of job knowle i 
usually correlate only about .35 with the knowledge measured by a ton d 
test, though ratings in one department reached a validity of .55 (Peters a 
Campbell, 1955; Morsh and Schmid, 1956). 3 

Another study investigated ratings given by department heads to foreme 2 
These ratings correlated only .22 with objective records of the work perf wet 
ance of the crews. The rating supposedly reflected productivity but it ac p 
ally correlated .59 with how long the rater had known the foreman, and . 
with his liking for the foreman (Stockford and Bissell, 1949). Such pe 
are particularly distressing in view of the widespread use of ratings as criten 
for validating tests. "—" 

Although the evidence demands that one be suspicious of the qum at 
ratings, they are sometimes excellent sources of data. Jack (1934) found ui 
ratings of "ascendance" by nursery-school teachers correlated .81 with a "e 
derived from objectively recorded observations of the child's acts 0? ing 
playground. For ratings to be depended upon, the validity of the " i 
procedure should be established in the particular situation where it is US 


Peer Ratings 

In many situations ratings by peers give more useful information ion 
ratings by superiors. Even where ratings by superiors are available a” eer” 
pendable, the peer ratings cover a different aspect of personality. A P er- 
is an individual who has the same status within the organization as pee 
son rated. Black’s study in which girls rated others living in the same = can- 
dormitory is one example. Another is the rating of each other by off A rat 
didates. In military studies, such reports are often referred to as “buddy 
ings.” 4 at- 

Whereas only one or two superiors know a subject well, ten to writ 
ers may give information when ratings in a class or a dormitory are cO dee? 
As a consequence, the average rating on any trait is highly reliable. In to 


it 
for well-defined traits in a group which has had reasonable ee i 
become acquainted, composite peer ratings generally have salia 
the neighborhood of .90. on? 


E i he 
A child who Impresses his peers as being a leader may not be t act 


whom the teacher regards as a leader; the peers, for example, may r. 
great weight on popularity whereas the teacher notices originality hi 3 
tive. It is of value for the teacher or counselor, however, to know W 
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Sons are regarded as leaders by their own group. Indeed, the information 
may be most significant in just those cases where the superiors and the peers 
have different impressions. The peer rating is an objective statement about 


the individual's reputation. Reputation is based to some extent on behavior, 


but the social pattern and role relations in the group introduce biases of vari- 


ous sorts. Among adolescents, correlations between reputations and careful 
Observations of corresponding behaviors range from .45 to .70 (Newman and 
Jones, 1946). 

To obtain peer ratings it is usually necessa 
are untrained, and we desire each rater to de 
adjective checklist (see p. 477) can be marked much more quickly than the 
descriptive graphic scale and can cover many aspects of behavior. In using 
the checklist to obtain information about particular individuals, related ad- 


Jectives are classified into groups and a count is made of the frequency with 


Which adjectives in each category are checked. Such a checklist leads to a 


descriptive profile. 


c mination Techniques. If thirty per 
enty traits, each person is being asked to give 600 responses. This means 


that considerable carelessness and halo effect may be expected, and various 
reduce the labor without reducing the 
important of these devices is 
f the group is asked to name a 
articular respect, such as 
most lacking in leader- 


ry to simplify the task. Raters 
scribe many individuals. The 


sons in a group rate each other on 


T "wd also be solicited, but this arouses anxie 
Wuse ey are being considered for such unfavora tions 
» as raters, they are reluctant to speak unfavorably of associates. The 

^tà gatherer can usually infer that the person who is never mentioned for a 
avorable trait belongs somewhere toward the other end of the scale. 
ùi °r young children, Hartshorne and May disguised the mompiauon tech- 
que as a guessing game. The "Guess Who" test describes various roles chil- 
ren may play, and each member of the group names the children he thinks 


each A e 
5 go pion fits. Typical descriptions are (Hartshorne and May, 1929, 


" ii is the class athlete. He (or she) can play baseball, basketball, tennis, can 
This, well as any, and is a good sport. 
One is always picking on others an! 


A n 
D for each child is made by counting the frequency with which he is 
fex ned for each description. 
oe Ratings. The sociogram is 
€ of groups. Characteristics of an in 


d annoying them. 


a method of studying the social 
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; 4 er 
may be studied by the Guess Who method, but the wo gi ieee 
insight by identifying cliques, hierarchies of "iip: py. and icis ds 
groupings. The sociogram was developed by Moreno LA prc onas 
technique has been amended in various ways which sacrifice nds 
for convenience, the best procedure is to request members of a gun (1959) 
cate their choices for companions in a particular activity. Gronlun 


He" c tary 
suggests the following directions for use with pupils in the upper elemen 
grades. 


king In 
During the next few weeks we will be changing our seats a by 
small groups, and playing some group games. Now that we all know Lows You em 
name, you can help me arrange groups that work and play best toge : inis you, 
do this by writing the names of the children you would like to have i nd d 
to have work with you, and to have play with you. You may choose : m not 
this room you wish, including those pupils who are absent. Your choices 
be seen by anyone else. Give first name and initial of last name. all 
Make your choices carefully so the groups will be the way you rez a 
them. I will try to arrange the groups so that each pupil gets at least tw 


re to 
a aes. Z 4 " " e sure 
choices. Sometimes it is hard to give everyone his first few choices so b 

make all five choices for each question. 


Directions should be concerned with re 
should be real choices. The data 


y want 
f his 


oices 
al group activities, and the gend 
are not obtained in a test setting; usto 
they are obtained as a means of dealing with the group. If data are 0 


‘1, Ji- 
from a less real question, such as “Who are your friends?” there is mn. 
hood of answers given to make a good impression. Subjects must M be 
their reports will be treated confidentially. The sociometric data die g, 0 
used as promised to set up work groups, committees, homeroom ~ js use’ 
whatever; this permits one to obtain coóperation when the technique 
again at a later date, ; the teste? 

Though sociometric ratings are easy to obtain in most situaron re poP 
must be wary of arousing anxieties. In a group of adolescent girls M m indi- 
ularity is a matter of great concern, a girl may resist the injunction 


he will be 
cate the one person she most prefers, or may worry about how s 
rated. 


„e 89 
; Figure 

btained, they are plotted in a poc n ue 
ss of fourth-grade girls early in the schoo i class” 
ee choices and were permitted to list also 2 ya- 


P : ical con®b oy 
oose. This sociogram shows several typica „afte 


ght 
groups or cliques. In one Emily is the most-soug mem 
Lenora, Caroline, Rhoda, and Louise as accep cia, and 
bers. In the other group, Agnes is the key figure, with Lurline, oi while 
Ann as members, Patricia is not thoroughly integrated with the ee rg oup’ 
accepted by Agnes, she is also reaching toward Emily in the ot ] tb 


a 
rather than Lurline or Ann. Agnes, who might be a popular leader 9 


After the choices are o 
is the sociogram of a cla 
pils indicated one to thr 
mates they would not ch 
tions. There are two 
person, with Jane, 
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girls, instead shows considerable hostility, rejecting three popular girls. Ella 
is not chosen by any of the others, and Tess is even more isolated. 

The sociogram obtained depends upon the question asked. For example, 
if sorority girls are asked to indicate their choices for roommates, and their 
choices of persons with whom to study, the sociometric patterns will differ. 


A==B A and B choose each other 
A—>B A chooses B 
A—3B A rejects B 


FIG, go. 
Svelopme 


Sociogram for a class of fourth-grade girls. (Adapted from Stoff, Division on Child 


nt, 1945, p. 297.) 


best friend may be thought of as too noisy or untidy for a good roommate, 


" an unpopular girl may be regarded as an excellent helper on school as- 
gr miis; Basic social configurations are fairly stable when different ques- 
ns are used, but one cannot assume that the interpersonal structure of a 


Brouy ; 
P is the same under all conditions. — 
| cla © structure changes with time. By December, Agnes was a “star” in her 
Ss, along with Rhoda and Emily. The cliques had disappeared, thanks to 


| the sk; 
: © skill of the teacher. Ann and Lurline still chose each other, but Agnes 


Ow t i P ^ " 
| tho urned her back on them, ignoring Ann and rejecting Lurline. Even 


A Social relationships change, an individual's level of popularity is re- 

| Pils ably Constant. Gronlund (1959) points out that among, elementary pu- 

and e Stability Over a one-year interval is about as high as for intelligence 
achievement, 


© term sociometric rating applies generally to all methods of identifying 
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rt H "Y be 
social relationships among group members. No sharp meer’ is hak ^ 
made between the descriptive peer rating and the sociometric rati i. Le 
general the latter is restricted to questions about whom the = I em 
or would prefer to work with, and thus is as much concerned € " inen rin 
reaction as with the rater's personality. When given willingly, - = etg 
ratings may be much more dependable than other sorts of MM 3 
zey and Borgatta point out (Lindzey, 1954, Vol. I, p. 406): 


, A he 

There is no need to train raters to engage in sociometric _ L- 
difficult and time-consuming task of attempting to produce hu yet 
frames of reference and homogeneous criteria in terms of whic E reci 
shall be assigned is avoided. The rater is asked to apply —Ó T 
particular, unique, and sometimes irrational criteria he has spe ene” 
time developing. Everyone is an experienced or expert rater ; pnt 
comes to sociometric judgments. Each of us has a vast body 0 Ros ra 
ence in deciding with whom we wish to interact and whom we W ein 
avoid. Liking and disliking, accepting and rejecting are part of ps P ck 
ess of daily living. . . . One might say that the individual y wat p 
these techniques is taking advantage of the largest pool of sensitiv 
experienced raters that is anywhere avilable. 


he 
The validity of responses to the sociometric questionnaire is attested N d 
finding that choices given by pupils as to preferred fellow actors in ee 
play correlated about .80 with actual choices when an opportunity to p 
impromptu plays was given (Byrd, 1951). 


9. What children besides Tess an 


10. What interpretations of Agnes' hostility can be suggested? king, ine 
11. Prior to this study, the teacher had characterized Tess as hard wor i 5 others 
ested in accomplishing tasks, “fits in nicely with the group.” Tess oe outloo 
with their sewing, at which she is superior. How would the teacher ram? 
and treatment of Tess be affected by the information from the sociogtjogra™ 
12. The following choices were made in a group of tenth-graders. Plot a 
and discuss the interactions shown. 


d Ella are fringers? 


Shirley chooses Charles, Jim, and Sam. 

Charles chooses Shirley, Sam and Jim; rejects Tom. 

Phil chooses Jim, Charles, and Shirley; rejects Wallace and Tom. 

Wallace chooses Phil and Jack; rejects Tom. 

Jim chooses Jack, Sam, Charles, and Shirley; rejects Tom. " 

Jack chooses Jim and Tom; rejects Phil. son, Sam and To 

Shirley is chosen by several girls whom she does not mention. 
were absent, 


13. When sociograms wer 


duty, ^ 
; bat C^ "she 
e made of squadrons of Navy fliers on com t 

was found that the “ 


" were often n° 33) 
administratively designated leaders ly, 1947, P 
ones chosen as preferred work leaders by the men (G. A. Kelly, 
What practical Suggestions follow from this finding? 
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14. 
In a group of sorority girls, the sociometric question "Whom would you choose 
as a roommate?" is asked; will results be the same if the question is changed 


is Macias whom would you choose to go on a double date?" 
* What would be the best way of studying the reliability of sociometric data? 


| " m of Peer Ratings. Information about the individual's reputation or status 
T can be used in many ways. The group leader uses it to identify 
oped bins 1 who require special attention and individuals who can be devel- 
Organizi o leaders. Sociometric information has frequently been of use in re- 
Me. iens à group so that it will function better. For example, Roethlisber- 
genial te ickson used sociometric data to organize factory workers into con- 
| o Gia e Moreno, in an institution for delinquent girls, assigned the girls 
Bos. . sung groups on the basis of sociometric choices, 
Officer oe can be used as a basis for selection and classification. Among 
toy Phere for example, the impression a man makes on his compan- 
config ring the early phases of training is likely to forecast his ability to win 
1949) €nce and acceptance as an officer. One study (Wherry and ‘Fryer, 
better — to the peer rating as the “purest measure of leadership . . . 
peer Min any other variable." Kelly and Fiske (1951, p. 169) found that 
ciation “ES of clinical psychology trainees after only a few days' close asso- 
iive s significant predictors of ratings of clinical competence made by 
only ies Y departments three years later. The median correlation of .25 is 
cholo ant below the coefficient of .84 for ratings by a team of trained psy- 
ine p assessors using full test and interview data. Neither validity coef- 
Com es high, partly because of the inadequacy of the criteria. Similarly, 
ings Site peer ratings of officer candidates correlate about 50 with later rat- 
Sive i Superiors in duty assignments. This correlation is extremely impres- 
Predict view of the criterion reliability of about .50. The rated traits which 
€ctua] the criterion include coüperative, emotionally stable, assertive, intel- 
Sn ita determined (Tupes, 1957). 
ividua] ha the peer descriptions point 
© lag tee impede his acceptance. Especially 
i Sight regarding his reputation, the peer r 
Should be examined during counseling. 


characteristics of the in- 
when the student seems 
ating points to behavior 


16, 
Cattell and Stice (1953) find that surgency (i.e, energetic, talkative, en- 
Usiastic behavior) correlates very little with leadership behavior as rated by 
fing er but correlates substantially with frequency of election. Explain this 
ng. What does it imply regarding the use of peer ratings as criteria? 


Ote 
Ww 

9rthy Rating Scales 

EW rau , 
tice; , “ting scales are distributed commercially, since the common prac- 
h investigation, 


1S to era 
develop a new instrument for each institution or eac 
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Some scales have been carefully designed and standardized for common re- 
search and administrative purposes. Figures 85 and 88 show scales for use by 
high schools in making recommendations about college applicants. Several 
scales for industrial merit rating have also been published or distributed 
through management consulting firms. Here we shall examine selected scales 
for two other uses. 

The rating of personality has particularly widespread application. Both for 
practical personnel decisions and for research, we wish to record the er 
pressions of peers and supervisors. A suitable set of scales will include traits 
which can be rated reliably and which are relatively definite and free from 
halo effect. 

Scales for ward observations and clinical diagnosis emphasize symptoms: 
In large hospitals, for example, it is useful for ward attendants to fill out such 
forms periodically on each patient, since shifts in observed behavior may - 
ply that the patient's treatment should be altered. Several scales for i 
patient behavior have been developed, one being the Wittenborn Psychiat- 
tic Rating Scales published by the Psychological Corporation (1955). e 
also Lorr et al., 1955.) The Wittenborn form presents 52 scales, organize. 
into nine scores representing different types of symptoms. Figure 90 p 
the ratings of a patient on the first five items. The clusters I to IX were p 
fined through factor analysis, and the unshaded block in the rating form - 
dicates that a particular response is relevant to one of the dimensions. T 
rater need only circle the number (0, 1, 2, or 3) which indicates his ae 
sion of the patient. The scorer then copies that number into the corr orn 
ing block. For example, the rating given on scale 1, indicating difficulty a 
sleeping, adds one point to the score in cluster I and cluster V. TWO a 
shaded blocks in a given column indicate that double weight is give? me 
item. per 

The dimensions, established by examining symptoms in a large agree 
of patients, are named in psychiatric terminology: I. Acute anxiety; » ex 
version hysteria; III. Manic state; IV. Depressed state; V. Schizop pon ij 
citement; VI. Paranoid condition; VII. Paranoid schizophrenic; vil. p 
phrenic schizophrenic; IX. Phobic compulsive. With few exception” s are 
Acute anxiety and Phobic compulsive) the correlations betwee? yi : 
low. The scores are moderately reliable, the median split-half correlati? to 
ing .82. Combining two or more independent ratings would be neces us 
get a dependable picture of the individual's condition. This is ips p 
limitation, since clinical decisions are likely to be based on trends orient 
eral weeks. It should be noted that the rating scale indicates the Proud” 
condition rather than his diagnostic category. A patient may ipn "d 
different patterns of behavior as he progresses toward recovery OF? 
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porarily upset, and the function of the scale is to record such changes rather 
than to give him a label. -— 
As a further example of rating scale development, we may mention j 

Fels Parent Behavior Rating Scales. These scales were developed to vd 
the preschool child's family. Trained observers visited each home — 
cally and wrote a descriptive report; to provide systematic data which ~ a 
be treated statistically, the observer also gave ratings on thirty scales. T H "t 
scales were designed to cover emotional relations, disciplinary methods, à 
values of the home. 


irecti iti , re ollows: 
The directions and definitions for one scale, for example, are as f 


Quantity of Suggestion (Suggesting—Non-suggesting) mM— 
Rate the parent's tendency to make suggestions to the child. Is the Pee child's 
stantly offering requests, commands, hints, or other attempts to direct the 7 


: pat » child's 
immediate behavior? Or does the parent withhold suggestions, giving the c 
initiative full sway? 


only 
This does not apply to routine regulations and their enforcement. EM s 
where there is opportunity for suggestion. Note that "suggestion" is 


: - ets : i nver- 
broadly, including direct and indirect, positive and negative, verbal and no 
bal, mandatory and optional. 


34's rou- 
— Parent continually attempting to direct the minute details of the child's 
tine functioning, and "free" play as well. 


i 2 : o next 
— Occasionally withholds suggestions, but more often indicates what to d 
or how to do it. 


á tend- 
—Parent's tendency to allow child's initiative full scope is about equal to 
ency to interfere by making suggestions. f 


ire O 
—Makes general suggestions now and then, but allows child large meas" 
freedom to do things own way. 


to 
= n nds 
Parent not only consistently avoids volunteering suggestions, but te 


: jous reac” 
withhold them when they are requested, or when they are the obviou 
tion to the immediate situation. 


" on- 
Such lengthy scales, requiring patient and thoughtful discrimination ^ 
trast markedly with the simple rating scales used in obtaining eae The 
dations on prospective employees or routine judgments from teac iie reli- 
elaborate definitions and fine subdivision of traits permit a much ue] 
able and comprehensive picture of the home than a simple form opie 
terrater reliability on single traits ranges from about .50 to 90. The Me stan” 
is designed to be used by a qualified professional observer who has a omm 
tial amount of information to record, information which could not j^ 5 
nicated fully in a coarse scale. Such an elaborate scale is unnecessary 
the rater has only casual impressions to convey. variabl 
The scale is organized around these factors: Warmth (first five fer 
in Figure 91), Adjustment, Restrictiveness, Clarity of Policy, we f 
ence. The detailed picture of five areas which are themselves furt d" 9 


entiated forestalls any tendency to characterize homes as simply 


whe? 
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“bad.” The scale results show that democratic homes can be cold and con- 
flictful, orderly yet affectionate, or warm and still maladjusted. The profile 
in Figure 91 illustrates the volume of information recorded in quantitative 
form about a single home. The Stones are warm, protective, rather coercive, 


i: 20 30 40 50 60 70 80 
Child-subordinate : E Es : i iy : Child-centered 


isapproval s N $ L uL : Approval 
ejection : : : z E : : Acceptance 
i ostile à ý à : Yi Affectionate 
solation j ý R Close rapport 
Inert contacts Ya : : Vigorous contacts 
Withholds help Ya : Over-helps 
Xposing Yi : Sheltering 
Onchalant Ya : Anxious 
Nonsuggesting A z : PES HEN : Suggesting 
‘critical b š i E x : : Critical 
Inactive : : " E Sr : : Active 
haotic Yi : : Coordinated 
Arbitrary policy : " , iv ài i : Rational policy 
ictatorial d ae : Democratic 
Retardatory sex. i : Acceleratory 
Warts curiosity r i x : : anaes curiosity 
Use x A een 
Freedom . E 3 " "E PET : Restriction 
u d : 1 t ` 5 
Sgestions optional Yi i : Suggestions mandatory 
lax enforcement : á X: + i z : Vigilant - 
ague policy 5 i av : : : Clear policy 


Maladjusted o a apaa ios. aiii 
rmony i NE Se i + Conflict 
"successful n : : - x x : i : Successful policy 
Ncordant pelle ; s ; LX A i : Contentious 
Seclusive famil e v 2 : i : Expansive family 
Mi contact á : : Y : t : Extensive contact 
oid Penalties ; S um $ : : : Severe penalties 
lective : E i ; y a : Emotional 
20 30 40 50 60 70 80 


Idwin et al., 1949, p. 28). 


FIG. 91, Rating of the home treatment of Ted Stone (Ba 


re of “restrictive indulgence. But more can 


Bivin 
b & as the authors say, a pictu à 
i in et al., 1949, p. 29; see also Baldwin 


ea; 
al "ned from the profile (Baldw 
» 1945), 
ent contributes à definite flavor to the in- 
n enforcement and also, we see, 
ear verbal and nagging, but with- 


Th 

e " 

te titan E of readiness of enforcem à 

mila; | 00- The mother is restrictive, but lax i 
er punishments. The home begins to app 
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out any core of enforced disciplinary policy. When ratings of low gere 
high discord, low effectiveness of policy, and high disciplinary friction are a vill 
the suspicion arises that Ted does not conform to his mother's standards. rene 
and nags, but achieves little. The fact that approval is still high in spite of al e 
conflict and discord might be interpreted as a determined effort by the oin : 
see the boy in the best possible light. . . . A low rating on understanding ma a 
it clear that the mother has little insight into what Ted wants and needs but ir 
stead is projecting on him her own motivation. 


This rating pattern fits the full clinical description given by the visitor. 
This evidence of validity raises the question whether careful and elaborate 
ratings cannot accomplish all that a clinical description might hope to do. An 
attempt to reduce individuality to a limited list of dimensions always loses 
some idiosyncratic features, however. The ratings characterize the Stone 
family in terms of those qualities on which all homes can be judged, but not 
in terms of its own recurrent themes and conditions. From the clinical note 
we learn facts such as these: Mrs. Stone has had lifelong trouble in forming 
emotional ties, with the significant exceptions of her mother and her son. She 
is contemptuous of her husband. She thinks that no one, not even her hus- 
band, understands that she has “sacrificed her life for Ted.” “Having identi- 
fied herself completely with her product, it was necessary that the child him- 
self be immaculate, perfect in behavior, precocious intellectually.” Ted pi 
prone to respiratory infections and subject to allergies; these intensify bis 
mother’s anxiety. Discipline is pulled in opposite directions by Mrs. Stone $ 
desire for perfection and her identification with Ted, “On one occasion whe? 
he was sent to bed an hour early as a punishment, Mrs. Stone decided as 
had been overly severe and went to the bedroom to read to him for the extr? 
period.” Such descriptive color and texture, while of no use for statistical Te 
search, is informative both to the clinical worker and to the research PSY" 
chologist. There is no reason to think that even elaborate rating systems ad 
replace descriptive accounts in exploratory research or casework. 


us 

17. Would an elaborate rating instrument such as the Fels scales be advantage? 
when obtaining ratings of a child by his teacher? 

18. How much of the “individual” information quoted from the casework 


on the Stones could have been covered by adding additional traits to th 
ing scale? 


er's notes 
e rat 
c Wit- 
No evidence on agreement between raters is given in the manual for bs it. 
tenborn scales. Why is this information needed? Plan a study to obta! 


19. 


OBSERVATION OF BEHAVIOR SAMPLES 


r 

" re O 

Self-reports and judgments by peers and supervisors are based on à me indi 

less haphazard composite of observations. The rater has not seen the ^ ir, 
vidual in all situations, and selective recall operates in both rating 2? 
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n give a more accurate description of typi- 


repor me " i 
port. Sy stematic observations ca 
ade between observations intended to 


cal behavior. A distinction must be m 
Cover a representative sample of behavior and observations in a standardized 
test situation. The former attempts to estimate typical behavior from a statis- 
m representative sample of situations actually occurring in life; the situa- 
aa dem : wat differ for r p men ese 

i situation for everyone. The situ o y be quite 


une : Sundin TE 
ee in the subject’s life. 
ield observations—i.e., observation 


cu 
is mstances—are relatively easy to carry out 
Ore suitable than standardized observations. Many investigators feel that 


cos Impossible to know personality unless we watch the subject react to the 

E ie that are most significant for him. Different stimuli are significant 

a ifferent people. Standard situations are perhaps not as likely to elicit the 
Portant behavior patterns as are the normal (dissimilar) conditions un- 

: E Which the subjects live. The difficulty lies in seeing enough of the per- 
ns normal behavior and in obtaining dependable records. 


1 of the subject under his normal cir- 
and for some studies they are 


S z 
ampling Problems 


i betewer one wishes to know the typical behavior of an employee, a stu- 
ma]. or a patient, the most direct way to find out is to observe him in nor- 
Situations, If he does not know that we 27? watching him, we obtain a 

uthful picture limited only by our skill as observers and our persistence. 


his i f ; : 
is our usual basis for judging associates and friends. Judgments based 


Pon observation, however, are likely to be untrustworthy on account of 


Sam li 

1 

ne errors and observer errors. 
now the “typical” behavior of 


u 


an individual, it is necessary to know 
OW he c : dre vi de 

fro e characteristically acts in a particular situation. But situations change 

n in: ment. If we observe the attentive- 

spf different impression from the one 

cheerfulness or politeness when 


e unfair. The only way to be even mod- 


to study 
must compromise be 


ied, our impression may b 
rtain of typical behavior is 
ing expensive. In practice, one 
iy and economy. 
ius in about individual differences is difficult because one am never 
exten, © two individuals in the same situation. Even as a situation is 
oa Constant, previous conditions cause people to belave differently. 
Jimmy fidgets more than John in the classroom, one is likely to infer 
» Jf the impression is confirmed 


tha 
tJi : 
by J mmy is “restless,” “nervous,” or “jumpy: 
s to be fundamental. But if 


e 
Peated observations, this difference seem: 


530 ESSENTIALS OF PSYCHOLOGICAL TESTING 


Jimmy usually comes to school without breakfast, if he expects to be criti- 
cized by the teacher for poor work, or if he is large for the chairs provided, 
the difference in activity may tell nothing about the boys’ basic restlessness- 
In fact, if conditions were reversed, Johnny might be more restless than 
Jimmy is now. At best, comparative observations show how different people 
act under their present conditions, but do not guarantee that the differences 
would persist if background conditions changed. 

One of the best approaches to precise comparison is time sampling. In 
time sampling, a set schedule of observations is planned in advance. The 
schedule is randomized so that each subject is seen under comparable ad 
ditions. In one study of social contacts of preschool children, for example, 
a schedule of one-minute observations was drawn up. After the observer 


953- 1000 


EDWwaRD—12/6/28—DST 

AB—Plays with. and mauls Paul Gi- Bomans RC 

c Teacher intervenes 60"  IJ—Slide $^ 
D—Goes up to Alma, Throws cover KL—Ia closet : 

EF. Es 117  Kii—Knocks Paul down; teache 2! 

iy ^ Too! intervenes 3i 

FIG. — l safal 
wur PE obtained in a five-minute observation of a preschool child (Thoma 


te move” 
- The diagram sh Edward's ™ 


ows the nursery school play yard and traces 
ments. Letters mark the start a H E 


nd finish of each activity. 
watched a child for on 
a full record, Childre 
altered from day to d. 
number of times dur 


z rote 
e minute, noting all social interaction, he wro 

; 
n were watched in a predetermined order T an eq™ 
ay. During the study each child was observe uring 


ing the first five minutes of the free-play hou 
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the second five minutes, and so on (Barker et al., 1943, pp. 509-525). 

The advantages of short well-distributed time samples is that the cumula- 
tive picture is likely to be far more typical than an equal amount of evidence 
obtained in a few longer observations. Moreover, errors of memory are re- 
duced, since the observer can make full notes during or just after observing. 
Time samples are especially suitable for recording specific facts that can be 
expressed numerically, such as the number of social contacts with other chil- 
dren. A slightly more elaborate record shows the complete activity pattern 
during the observation period. Figure 92 reports the behavior of Edward 
during a five-minute period: what he did, for how long, and where. 

A much more extensive sample is obtained in the “day record” technique 
of Barker and his associates. Their general aim is to study the pattern of a 
ar attention to the various settings in which he 
Moves, For example, they wish to see what behaviors are evoked from small- 
town children in the course of a day, as compared to those evoked from chil- 
es. For this purpose, an observer goes with the child 
am, from the moment of awakening until 
rd is illustrated by this description 
t lot (Barker and Wright, 1951, 


child’s life, with particul 


dren in larger communiti 
throughout his whole day's progr 
the end of the day. The form of the reco 
of three boys playing with a crate in a vacan 
Pp. 349-850); 

te from side to side in a calm, rhythmical way. 


angered again. Stewart came over and very pro- 
[Observer's opinion.] 


5:39, Raymond tilted the cra 
Clifford's feet were end 
tectively led Clifford out of the way. 

Raymond slowly descended to the ground inside the crate. 

When Stewart came back around the crate, Raymond reached out at him, 
and growled very gutturally, and said, “Tm a big gorilla.” Growling very 
ferociously, he stamped around the “cage” with his arms hanging loosely. 
He reached out with slow, gross movements. . 
Raymond reached toward Clifford but didn't really try to catch him. 


Then he grabbed Stewart by the shirt. 
Imitating a very fierce gorilla, he pulled Stewart toward the crate, — 7 
Stewart was passive and allowed himself to be pulled in. He said Why 
don't you let go of me?" He spoke disgustedly and yet not disparagingly. 
Raymond released his grasp and ceased imitating a gorilla. er 
He tilted the crate so that he could crawl out of the open end. AS ^ 
crawled out, he lost control of the crate and it fell over on its side wit 


=i open end perpendicular to the ground. 
ewart said, “Well, how did you get oui?" 
nep nd said self-consciously, “I fell out,” "rr a, laugh. 
Si ss QURE brief if wondering what I though” 
*40, riefly at me as if won ul "m 
5 omy and savebuilly crawled inside and went directly through the crate 
nd out the o d l i 

icwart and Clifford gat in front of Raymond and tried to get him to chase 
em and continue imitating a gorilla. 
àymond stood immobile and didn't coo 


tl 
pera te. 
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Finally Stewart said to Clifford, "Maybe if he'll follow us through, geri 
can crawl out this end. Then we can tip it up and have him caught again. 


The day record has some advantage in showing the total sequence of wd 
tivities, but it also has disadvantages. The unconcealed observer may iet 
some effect upon behavior, an effect which cannot be assessed. And the = 
of a single full day does not obtain completely typical information hn 
child in question, since each day has its own unique characteristics. Nei s 
of these is a serious drawback for the Barker studies, where the goal a 
overall report of the normal experience of a group of children. ne 
many children, each on a different day, irons out sampling error so 
group data are concerned. The observer is present in all the data and the 
fore does not prevent comparisons among groups. : tre- 

A series of time samples gives at best a statistical composite of differen s 
sponses. Responses which the observer counts as the same d puro 
have quite different meanings, and situations which appear similar to a a 
may evoke quite different responses. Newcomb (1929) observed boys Ade 
Summer camp, making daily records of m 
coöperation in after-meal work, f 
When these day 


h 

any particular responses ne 

i i orsistence 
ghting with other boys, and persis 


be 
-to-day records were studied, most boys were found pha 
alike. 


Behaviors grouped within one of these poem 
traits correlated little higher than obviously dissimilar behaviors. = ob- 
study found only trivial correlations (median .20) between punctuality 
served in different situations (Dudycha, 1936) 

Conclusions formed in one situation—even on the basis of many wa 
lated observations—are valid only for that situation. Inference as to x^ un- 
Person would act in another situation is warranted only when xar 
der the two conditions have been shown to be correlated, or when ctore 
tion yields so much understanding of his underlying personality st ans tO 
(his stimulus equivalences) that we can see what a new situation me 
him. 

Symonds (1981, 
quate sampling: 


cumu- 


de 
ra 
P. 5) has commented emphatically on the need fo 


:agle 
iable, 2 pe 
swer 
serv?" 


A single observati 
test is unreliable, a 


a question is unrelj 


on is unreliable, a single rating is unrel wes 
^ j : ingle 
Single measurement is unreliable, a sing 


able. Reliability is achieved by keeping uP ° 


JUDGMENTS AND SYSTEMATIC OBSERVATIONS 533 


tions, ratings, tests, questions, measures. . . - If you ask one teacher for 
her judgment of a boy's trustworthiness, you obtain what she has been 


able to observe in those few narrow classroom situations that appeared 


When her attention was particularly directed to some act involving hon- 


. on the other hand, requires the judgment of 


esty. An adequate rating 
at several different times. Reliable 


Several raters in several situations 
evidence is multiplied evidence. 
The extreme variation in performance is illustrated by a study of naviga- 


tors, Students were taken on missions where their task was to continually 
air speed, etc., by dead reckoning. On each mis- 


and the accuracy of the man's air speed 
report for the leg was recorded. The score for each mission had a split-half 
reliability of .77; this is an indication of the man's consistency from one leg to 
the next, under the same wind conditions, with the same plane, etc. A corre- 
lation was also computed between scores on different days. While this corre- 
Aton varied from class to class, the mean reliability coefficient was .00 ( Car- 
ter and Dudek, 1947). Differences in score are determined almost entirely 
Y transient conditions rather than by the individual’s ability. Under these 
Circumstances, even combining information from several missions would not 
Sive a useful report on the individual. . 
Tow many observations are required to obtain reliable data depends on 
1€ problem. The experimenter can estimate reliability of sampling by corre- 
ating ratings of “odd” with “even” observations. By this means; it was deter- 
um that 24 or more five-minute time samples permitted reasonably sta- 
€” estimates of individual differences in preschool children (Arrington, 


co : ie 
i mpute their own position, 
io 

n, four separate legs were run, 


a ) In general, many short observations are superior to a few longer sam- 
Pies of behavior. 
20, ild's behavior be obtained if he were al- 


Why might an unfair picture of a ch 
ino observed during the first five mi 
21, Pace Mrd Edward's personality could be obtained from 
22, v ee inp o pes [iir behavior are discarded in making 
42, Ax oblective record such as Figure 72° Id it be better to test every flier with 


5 a criterion i : ch, WOU 
on lection research, at E 
repeated beu thasame ddy of with a similar number of landings spread 


Over 
24. Several days? T " * Itiplied evi T 
SY one say, paraphrasing Symonds, Valid evidence is multiplied evidence"? 


nutes of the play period and never dur- 


9b 
Serve 
r Error 
e notices some happenings and 


. Whe 
Y never a person observes an event, h : Rr 
since any activity has too many 


Sno ; 
"65 others, This is a necessary difficulty, 
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aspects for the mind to attend to all at once. Especially in social pies 
the complexity of interaction prevents exhaustive reporting. If errors in E: E 
serving were merely random omissions, they would be unimportant. But © D- 
servers make systematic errors, overemphasizing some types of happenings 
and failing to report others. The 
Viewing the identical scene, observers give widely different reports. 
following reports were written by four observers, each of whom saw t» 
same motion-picture scene of about ten minutes’ duration (from the film 
This Is Robert). The film was shown twice without sound. The film [e 
quence, taken in the classroom and on the playground, showed severa 
activities which revealed much of Robert's personality. The observers Mert 
directed to note everything they could about one boy, Robert, and were tol 
to use parentheses to set apart inferences or interpretations. Numbers in 


. : : serte ai 
these accounts, referring to scenes in the film, have been inserted to 
comparison. 


Observer A: (2) Robert reads word by word, using finger to follow pu 
(4) Observes girl in box with much preoccupation. (5) During singing. he e 
general doesn't participate too actively. Interest is part of time centered elsewhe! : 
Appears to respond most actively to sections of song involving action. Has ten ts 
ency for seemingly meaningless movement. Twitching of fingers, aimless thrus 
with arms. d 

Observer B: (2) Looked at camera upon entering (seemed perplexed pes 
interested). Smiled at camera. (2) Reads (with apparent interest and with @ s 
degree of facility). (3) Active in roughhouse play with girls. (4) Upon "T 
kicked (unintentionally) by one girl he responded (angrily). (5) Talked it 
girl sitting next to him between singing periods. Participated in singing. i Cw 
appeared enthusiastic.) Didn't always sing with others. (6) Participated in ? sive 
pute in a game with others (appeared to stand up for his own rights). ABTS er 
behavior toward another boy. Turned pockets inside out while talking to pM o 
and other students. (7) Put on overshoes without assistance. Climbed to to 


A A idn't 
ladder rungs. Tried to get rung which was occupied by a girl but since dn 
ive in, contented himself with another 


place. p, he 
Observer C: (1) Smiles into camera (curious). When group breaks Troi 
makes nervous gestures, throws arm out into air. (2) Attention to reading 


S. 
Reads with serious look on his face, has to use line marker. (8) Chases girls, T 
(4) Girl kicks when he puts hand on her leg. Robert makes face at her. p rs in 
ing. Sits with mouth open, knocks knees together, scratches leg, puts in om nt 
mouth (seems to have several nervous habits, though not emotionally ort A 
or self-conscious), (6) In a dispute over parchesi, he stands up for his * 
(7) Short dispute because he wants rung on jungle gym. ed and 

Observer D: (2) Uses guide to follow words, reads slowly, fairly forc 

with careful formation of sounds (perhaps unsure of self and fearful of mis” side 
(3) Perhaps slightly aggressive as evidenced by pushing younger € obvious 
when moving from a position to another. Plays with other children wit? |. ple 
enjoyment, smiles, runs, seems especially associated with girls. This n d 
in games and in seating in singing. (5) Takes little interest in singing 


stakes; 
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TOMES: hands and legs (perhaps shy and nervous). Seems in song to be unfamiliar 
with words of main part, and shows disinterest by fidgeting and twisting around. 
Not until chorus is reached does he pick up interest. His especial friend seems to 
bea particular girl, as he is always seated by her. 


Every observer is more sensitive to some types of behavior than others. 


How does he regard nailbiting, failure to look one in the eye, or profanity? 
If he considers these significant, he will note them and base his impression 
on them. In the same situation, another observer might give greatest atten- 
tion to voice modulation, careful use of grammar, or friendliness of conversa- 


tion. Ideally, an observer would base his impression on every revealing act, 


but when he is looking for one thing, he necessarily overlooks something 
J 


else, 

at they see. If observers recorded only objective 
ata might reach quite different interpretations, 
but people always try to give meanings to what they see. When they make 
tend to overlook facts which do not fit the interpre- 
t facts needed to complete the event as inter- 


, Observers interpret wh 
acts, others studying the d 


àn interpretation, they 
tation, and may even inven 
preted. 


25. What do you think really happened in scene (4)? Which observer came closest 
" to adequate reporting of it? . 
6. Which of the numbered scenes appears to give the most significant information 
ed that information? 


about Robert? How many of the observers report 
27. Did the observers of the film about Robert succeed in identifying and marking 


all their judgments and hypotheses? . . 
28. Do the observers of the film abovt Robert ever disagree, or are the differences 


entirely due to omissions and oversights? 
w well his 6-year-old child gets along 


29. A clinical psychologist asks a parent ho 
f the following errors might operate: 


with other children. Illustrate how each o rors n J 
a. The observer has not observed an adequate sample for judging typical be- 


havior. . 
b. The observer notices events which fit his p 
c. The observer is likely to note the behaviors he cons 


ignore others of equal importance. i 
d. The observer may give a faulty interpretation to an 


reconceived notions. 
iders significant and to 


event. 


Systematic Recording. Where possible, it is desirable to record countable 
units of behavior. For example, the extent to which factory workers attend 
to their work may be described by a time record which notes the exact mo- 
ments when they are at work, and the time spent in looking around, obtain- 
ing tools, and visiting. The causes of distractions can also be noted. Such 
records for different workers and departments can be analyzed both for 
Paging the workers and for planning rest periods or improved tool distribu- 
lon, 


Child development has been s of social contacts, 


studied through record: 
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play activities, speech, and other objectively defined behaviors gon 
al., pp. 509 ff.; Thomas, 1929). Such precise reports are especially useful 5 
measuring changes, since the observer's memory cannot compare perform 
ance now with performance several months ago. E d 
Even if the behavior observed is too varied for direct tabulation, it "d " 
possible to define categories of actions so that the observer needs only : 
check each incident as it occurs. One of the best examples is Bales’ — 
(1951) of categorizing social interaction for the purposes of research © 


à de- 
small groups. Twelve categories describing various types of response are 
fined, including: 


Shows solidarity, raises other's status, gives help, reward. 
Shows tension release, jokes, laughs, shows satisfaction. 

Asks for orientation, information, repetition, confirmation. 
Disagrees, shows passive rejection, formality, withholds help. 


The observer tallies responses moment by moment. By noting who e 
each remark and its approximate time, he can keep a full record of bacon ven 
play of thought and emotion. An "interaction recorder" using a motor-dri ex 
tape has been designed to facilitate such recording. Later analysis can 7 


: MER á Bern ai other 
amine individual differences such as the emergence of conflict and 
group processes. 


for 
30. What advantages and disadvantages would a checklist or schedule have 


iptive 
each of the following purposes, compared to a one-paragraph descr!P 
report? 


its 
a. A social agency wishes its visitor to report the condition of homes at 
clients, including furnishings, conveniences, and neatness. 
b. A department store sends shoppers to be served by its clerks an 
serve their procedure and manner. 


€. A state requires an observation of the applicant's driving before 
license to drive. 


d to ob- 
issuing a 

He sto- 
31. nding a 


perso" 
using 


An investigator wishes to measure punctuality, for research purposes: 
tions himself where he can observe the arrival of each student ee 
Particular class. Number of minutes early or late is recorded for d in 
Records are made on several days. What assumptions are involve 
the average of these records as an index of punctuality? 


32. Tape recordings of group discussions are used to study individual ái 
in dominance, 


tion about per 


ferences 


informe" 
leadership, and other traits. What types of observable ! 


sonality could not be obtained from the tape? 


re wel 
Anecdotal Records; Although objective counts and tabulations ? id- 
suited to research, the: 


ir information is of limited value for ore S, 
ance. Anecdotal records escape the bleakness of quantitative AE a an 
fering a more lifelike sketch of the subject. The observer is free to » on th? 
behavior that appears significant, rather than having to concentrate 
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same traits for all subjects. Often the anecdotes are reports of incidents noted 
by a teacher or supervisor in daily contacts. 

In an anecdotal record, the observer describes exactly what he observed, 
keeping interpretation and fact separate. The record is made as soon after 
minate errors of recall. Cumulated over a period 
her picture of behavior than any other 
are typical anecdotal reports: 


the action as possible, to eli 
of time, the incidents provide a ric 
equally simple technique. The following 
ass, took it back to the office (where I 


Paul, after projecting the film for the cl 
appened to be) to rewind. He is not very skilled, and missed his timing, so that 


much of the film cascaded onto the floor instead of going onto the takeup reel. 
John came up just then and said something sarcastic about Paul's clumsiness. Paul 
E no answer, but kept on at work with no change of manner and a stolid face. 
ichard, who had been watching Paul, turned on John, told him to ‘shut up and 
Sive Paul a chance,” and muttered something about “some of these kids make me 
s (Paul seems to suppress emotion; he certainly heard John's very unpleasant 
One.) 
Joan spent the entire science period wandering from group to group instead of 


helping Rose as she was expected to. She interrupted many of the others, telling 
nem they were doing the work asked a lot of (foolish) questions 
("Does filter paper make certain things go through or just keep certain things 
Ou") and was teased a good deal by the boys. By the time Rose was finished she 
p and Joan helped put things 


r 
atumned; Rose was quite angry, but they made u A E 
Way. But on her first trip to the storeroom she stayed to plate a gold ring with 


mercury, while Rose made repeated trips with the equipment. 


wrong. She 


The reporter has two responsibilities: he must select incidents worth re- 
Porting, and he must be objective. Both incidents characteristic of the per- 
Son and striking exceptions to his normal conduct are helpful. The typical 
Incidents provide a more individualized picture than the hackneyed trait 
Dames that would otherwise be used— friendly, showing initiative, rude, and 
30 on, Exceptional actions are rarely reported in ratings and general A gai 
Sions, but they too are significant. A single incident showing interest in the 


company's welfare from a man known as à troublemaker or à sign of enthusi- 
asm for learning on the part of a boy who rebels against school may be the 


€Y to a new and successful treatment. The observer must weed out value 
port the exact occurrences, 


judgments and interpretations, attempting to re 

ae stents panti events and environmental — ene 
ran Sver report “everything” about the incident. The reporter selects for his 

"ip facts he considers relevant. 

to m i anecdotes tell little. As anecdotes accum ^os 
will y in a picture of the person's habits. If a particula beg 
an i Phi: An effective method of determining persona ty E 
individual is to search through the anecdotes about him to 


cumulate, however, they begin 
response is typical, it 
acteristics of 
etect repeti- 
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; : > r- 
tions. A summary based on these recurring patterns usually requires confi 
mation by further directed observation. 


Suggested Readings 


Biber, Barbara E., & others. Recording spontaneous behaviors. Life and ways of 
the seven-year-old. New York: Basic Books, 1952. Pp. 33-53. . 
An account of procedures used in ten-minute schoolroom ohana 
gether with illustrative anecdotal records and evidence of observer reliabi s 
Gronlund, Norman E. Validity of sociometric results. Sociometry in the classroom. 
New York: Harper, 1959. Pp. 158-188. late 
A review of studies shows how sociometric choices of school children re' 
to observed behavior, teacher opinions, and adjustment. indzey 
Lindzey, Gardner, & Borgatta, Edgar F. Sociometric measurement. In rae 
(ed.), Handbook of social psychology, Vol. 1. Cambridge: Addison-Wesley» 
1954. Pp. 405-448. ai 
A comprehensive summary of the major sociometric techniques includes ae 
relations between sociometric evidence and other measures of persona’ 
The authors draw particular attention to limitations of research or practic 
decisions based on sociometric findings alone. 1 
Newman, Frances B. The development of methods in the adolescent growth stu bA 
In Frances B. Newman and Harold E. Jones, The adolescent in social group? 
Appl. Psychol. Monogr., 1946, No. 9, 16-29. ar- 
This describes different techniques, ranging from quantitative ratings fo ee E 
rative accounts, used in the same program for observing adolescent pe s 
ity. Special advantages of each approach are indicated. Subsequent chap ip 
give information on reliability and validity, and the use of the data in ¢ 
analysis. g i 
Prescott, Daniel A. Interpreting behavior. The child in the educative proc 
New York: McGraw-Hill, 1957. Pp. 99-150. e com 
Anecdotal records collected on one boy throughout a school year are dis- 
pared to show consistencies and deviations from his normal pattern- The rec- 
cussion shows how teachers form and test hypotheses when using suc give 
ords as a case-study technique. Several other chapters in the book also 
useful information on the collection of anecdotal information. 
Tuddenham, Read D. Studies in reputation: II. The diagnosis of social a 
Psychol. Monogr., 1952, 66, No. 1. 
Reports on school children obtained by the nomination technique can 
for personality analysis. Five illustrative records are interpreted. 


to- 


djustmen" 


be used 


18 


Performance Tests of Personality 


sidered the well-established and unques- 


THE pr . 
E preceding chapters have con 
ality: interest measures for use 


ea useful techniques for studying person ! s 2 
unseling, adjustment inventories for screening purposes, sociometric 
and peer ratings, empirically keyed predictive questionnaires, and system- 
atic sampling of behavior. Some of these are of value to the practicing per- 
Sonnel psychologist and others are better suited to gathering research data, 

ut each of them is capable of gi ata, the major sources of er- 
Tor in interpretation have been id ore interpretations are ade- 


quately supported by combined 


€ now turn to procedures whose value is u 


m * nhi "e: 
ently disputed. Although performance tests and projective techniques 
hey have reached a much less ma- 


ay > 

ies € been in use for about thirty years» th eee 

re stage of development than methods discussed to this point. The com- 
ity theory are one source 


ving reliable d 
entified, and sc 


evidence and theory. 
nsettled—indeed, is vehe- 


m of personality and the instability of personal à 
Bi iculty, When there is no consensus as to the most important traits to 
on ir or even, as Allport says ( Lindzey, 1958), on whether it is fruitful to 
Noris of personality in terms of traits, test developers have no target on 
ii to concentrate. Conversely, when there are no outstanding tests inan 
nes research is scattered so widely that no coherent body of evidence be- 
rie D available as a base on which to build theory. When the theory of in- 
Nie was confused and primitive in the first quarter of this century, the 
"s p UMS scale gave to research effort made it possible tò move toward a 
cà clearer theory regarding the nature and growth of ability. 
Pines, € there are no performance tests of salient importance and few clear 
ue we must confine ourselves in this chapter to describing enough il- 
St tive tests to show the range of approaches. In addition to a variety of 
Shar mance tests, this chapter describes projective tests, there bain den 
à t4 distinction between the two lasses: We shall present some evidence 
loe ^ validity of scores as psychometric predictors, that is, on the use of per- 
nce and projective tests as quantitative measuring instruments. Many 
539 
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è ionistic assess- 
of these instruments, however, are used primarily for pe dese ert pid 
ment in which the scores become raw material to be ier. seinen ri 
data into a portrait of the whole person. The Se c 
sues regarding such assessment we reserve for wr ar : : —" with 

The aim of the performance test may be clarifie b. E Lun mutet: 
the time-sampling method of determining typical be m A renee 
The limitations of time sampling are its high cost and t iis peers 
when obtained, depend upon both the subject and the vain personality 
happens to have been observed. It has been the great ia dente, sip 
measurement to invent procedures which would give quen : and would 
would directly represent behavior rather than biased TRAST, 
permit direct comparison of individuals in the same ie — d to 

A performance test is an Observation in a standard sitúa : for axsinglls 
elicit a particular type of response. One trait of great pr A to judge 
is how the person controls and expresses aggression. It is di ire is only 
this by observing the daily life of most subjects, because the ose ali des 
occasionally in an aggression-provoking situation, Testers -— mper 
veloped standardized procedures for annoying the subject, such RA certainly 
ing the opinions he voices in a standard interview. This almo 


sors “sinkin 
pared psychological testing to the geclogists "E 
ain samples of significant material. the time 
t part, strictly surface impressions, and designe 
sample sinks its shafts entirely at random, the performance test is 


a 
of such 
to provoke exhibitions of truly critical behavior. The usual features 

test are as follows: 


9 The stimulus 


Jl 
; ible for 2 
Situation is made as nearly uniform as possib 
subjects. 


ior 
. f behav? 
is designed to permit variation in those types o 


Which the tester wishes to observe, 


An example is 
whether men enter 
tion of an apparatu 
uli. The apparatus 
vers), which the 
buzzers sound, 
trically. The ex 


^ : change 
examinee resets continually as signal al d 
The time required to react to each signal is r oncealed o 
aminee is told that he will be observed by a c 


elec 
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Server "just as a checkpilot will rate you in flying." Administration is stand- 
ardized: one minute of rest and anticipation, one minute of directions 
regarding signals and controls, and three short test periods. In each period, 
en increasingly reproving "stress directions" while he is 
de more complex. In test period C, 
after intervals of about fifteen sec- 


the examinee is giv 
busily moving levers. The signals are ma 
the pattern of lights changes six times, 
Onds, while the examiner is delivering the following speech in an urgent 
Sin ai make lights Bicker on and off. Be steady. . . . Quit making 

aren't moving fast enough. . . . More speed. . . . Hurry and 


stop the clock. . . . Last chance. . . . Set controls quickly. . . . You are 


sul making errors.” The concealed observer, meanwhile, makes extensive 


ratings of manner and reaction to criticism, and objective clock scores are 
recorded (Guilford, 1947, pp. 660-664; Melton, 1947, pp. 811-814). 

: The greatest advantage of a test observation is that it reveals characteris- 
ties which appear only infrequently in normal activities—characteristics 
Such as bravery, reaction to frustration, and dishonesty. Second, desire to 
make a good impression does not invalidate the test. In fact, just because he 
ìs anxious to make a good impression, the subject reveals more than he nor- 
mally Would. It is necessary, however, to take this motivation into account in 
euerpreting results. The third advantage of the performance test is that it 

Omes closer than other techniques to comparing subjects under identical 


Conditions, 
Performance tests vary greatly in purpose and in design. They may be 
ing single narrowly defined con- 


stom Psychometric instruments measurl Su oe ota 
ius S such as persistence in routine work, or they may bea : i i p 5 $ 
E e evaluation of the person's total life-style. They may be wor samples 
“ae success in a specific assignment, Or cross sections of behavior 
BS reference to any single future situation. pes 
almo ations used to elicit performance range from. ghly s T io o 
e SS totally unstructured. A situation is structured if it has for all subjects a 
little «— meaning. An unstructured situation presents so few cues or has so 
pattern that he can give it almost any meaning he wishes. A common- 
z Unstructured stimulus is the strange sound in the night. Is it the sns 
inf Blar? the cat? water dripping? The interpretation we make is strongly 


B i i ious, and, of 
need by our interests, by fears conscious and unconscious, * 
s exactly 


Cou; 
d by knowledge. In a structured situation, the subject know 

Str S he ds expected to do and how he is expected to do it. In the un- 

ES e more ambiguous the situation, 


s red situation, he guides himself. Th : : 1 
pn opportunity there is for individual method of interpretation an 


per H H 2. 
formance, An extremely unstructured situation is established in Waeh- 


Ney’ ‘ach 
i (1946) procedure for studying personality. She observed her subject s 
Avior and products after turning him loose in a studio equipped with all 
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types of art media and materials, with little more instruction than "You may 
do anything you like with these." cry tox because 

Highly structured tasks are excellent for measuring ability just op- 
they force everyone to try the same thing. Projective situations, at the xa 
posite extreme, use almost totally unstructured stimuli. The projective wa 
so named because it permits the subject to project into the situation his n 
conscious thoughts, wishes, and fears (L. K. Frank, 1939). Thus the Pes : 
holder who interprets the creak in the dark as a burglar may be meres : E 
that he is more anxious than another man, who interprets the same stimulu 
as a natural phenomenon and goes back to sleep. 


1. In each of the following situations, discuss whether it would be preferab e t 
employ observations in natural conditions, or standardized observations W 
conditions are fixed in advance and identical for all subjects. d clarity 
a. The telephone company wishes to rate its operators on courtesy and c 

of speech. It is able to tap conversations and make recordings. ditions 
b. It is desired to screen Navy personnel for tendency to panic under con 

of extreme noise, as in amphibious landings. Id boys 
€. An investigator wishes to study the habitual recklessness of 7-year-o 

in climbing and jumping. 


: ctured 
2. To what extent may each of the following be considered an unstru 
stimulus? 


a. A teacher, durin 
hasty movement 
aisle. 

b. A group of peo 
at each table, 


es a 
g a fest, glances up from her desk and barely eei the 
of one boy who is pulling his hand into his lap fr 


ing played 
ple play duplicate bridge, the same set of hands being P 
R e, edu- 
gned to obtain information about ade po ‘plan 
answers are anticipated and presented on 


Comprehension test, 
C. A test of 


to 
nations U 
addition which Presents in random order the combinations UP 

9 -- 9, th 


ime 
n H i the = 
© pupil being directed to do as many items as he can in 
allowed, 


- In the Porteus test ( 
him to trace the co 


d 


* takes 
:me it fd 
Figure 1), the subject is to solve a maze. The tim 
rrect path with his pencil is scored. 


STRUCTURED TESTS MEASURING SINGLE TRAITS 
Character and Persistence 29 
g, Be 
The Character Education Inquiry of Hartshorne and May [D ed 


a uan 
1980) was the only extended effort to evaluate personality by pen) k ly 
tative and objective methods. Character traits can be validly asses 
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b 3 à 
oo people to temptation i a situation where they believe they 
ly erm pen without detection. Traits studied by Hartshorne and 
iar: " FRENAR honesty with money, persistence, coóperative- 
Honest — 
"— " eine money was tested by presenting arithmetic problems in 
ach pupil had to use a boxful of coins. The box provided for each pu- 


pil was se E : 
s secretly identified. At the end of the work each pupil carried his own 


Ox to jl : 
à pile in front of the room. Since pupils were unaware that boxes 


[c : 

oe ed many took advantage of the opportunity to keep some of 

ha diti ^ Tonasty ina situation involving prestige was tested by asking 

tae a, o do an impossible task, such as placing marks in small circles 
ping his eyes closed (Figure 93). Many children turned in “suc- 


CIRCLES PUZZLE 
First Trial 
r each trial. Put the point of you 


f the oval. Then 
small cross or 


r pencil 
when the signal is given 


Wait for the signal fo 
X in each circle, taking 


on the cross at the foot o 
shut your eyes and put a 
them in order. 


x 


ge HERE with your fin 
test of honesty (Hartshorne 


Hold the po! ger tips 


FIG. 93. An "improbable achievement” 


and May, 1928, p. 62). 
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cessful performances" which could have been obtained only by vat 

Motivation to work has been of particular interest because it is toug dat 
as the link between aptitude and achievement. The employer, the veilt x 
the clinician, and all other users of tests wish they could peg emg 
person’s behavior will bear out the promise shown in tests of ——— d 
investigators have explored possible performance tests. pd : 
tested how long children persist as a task becomes difficult. guis «t ilis 
Story which builds to a climax: "Again the terrible piercing shrie den 
whistle screamed at them. Charles could see the frightened face of the ie E 
neer. . . .” Here the examiner tells them that if they wish to learn the € 
ing they must read the difficult printed material that follows: 


CHARLESLIFTEDLUCILLETOHISBACK“PUTYOURARMSTICHT 
AROUNDMYNECKANDHOLDON 


NoWhoWTogETBaCkoN thETREStle.HoWTOBRingTHaTTErrIFIED 
BURDeNOFACHiLDuPtO 


: Tfee 
SN ALly tAp-taPC AME ARHYTH Month e BriD GeruNNing fee 
TcomING 


„hers it; the 
The pupil Separates each word with a vertical mark as he decipher Ar 
amount deciphered is an index of persistent effort ( Hartshorne an ct will 
1928, p. 292). Some other tests have determined how long the subje pre- 
continue to work on an exceedingly difficult or impossible problem, 
sented in a series with solvable problems, 


nclu- 


f 
1 " ang ° 
5. Joe is a known delinquent, having gotten into trouble together with a 9 


: fact 
r thefts and disturbances. How can you apiela a 
honesty, coöperation, and april puP 
6. The test illustrated in Figure 94 is a test of self-control, requiring the me mo* 
attractive distractions. Does this measure the sa 
Story-completion test of persistence? 


il te 


Perceptual and Cognitive Styles im- 
The psychologist observing any problem-solving behavior is Ti wc an 
Pressed by individual differences in the way subjects attack prob meno 
Surmount difficulties, Individual tests such as the Wechsler are "informa" 
used as Opportunities to Observe such styles or habits (see p. 191). deal with 
tion of this type might predict what problems a person could best dial trai" 
and what errors he would make, and might open the way to remed? 
ing to improve his thinking, 


o 
ers 
pe — ap 
Tests of Cognitive processes arc concerned primarily with how 


Cop: 
Teachers Cope: es titute of Educational Research 
vised by Allen M umba University, New Vork 


Beginiliers: FORM C-2 


jose a 917684312843- 
Ke "eae, UB LO Biv Ppa picked » 
L- - A c x —p. SO y Pe 
tasafagagya pe 5743753683150 
VACATION fa de w^ 


Pick Fono 


LL Mere one emer we 
73657832843 6= 


Hang ow 


Terr cM 
ICE CREAM 
[Ast athe tently iv flab. St Ü 
o fund. 
Ts LU aA 
263532457124- 364136234271- 


y liL V BEGIN OVER Why Lus iue fw 
Loos «1 CANDY x eis 


-D pa O Ag 


8275234596 42= 523413547217- 


a 


s 
Mu wHo Ha 


ve » 
46125341354- 345894321632- 
«S 90e cocti? uen 

wits Ene =, Sowa, S.C. 
> VAI EU An A weit ser temen 15 

N - 
4 PE ADR LARD rt "ner kg 7 
65342586 125= 91 314657= 


FIG, 94, raction. (Hartshorne and May, 1929, p. 308. 


d test r resistance to disti 
o! | or rest 
Pro f self-contro 


d by permission.) 
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TT st of Gestalt 
Finem fiir meon, This Se t = p hd and 
pyckiology aod me kokag ie imd Su been concerned 
ther Gestaltist investigators. Gestalt psychology has a annis 35s dis 
with the brain as an information-processing organ. If the EN y its con- 
formation-processing system, simple perceptual aci ai i by an 
stant characteristics. These tests are analogous to the " Er into an ampli 
electronic technician when he puts a perfect sine-wave i. n This gives the 
fier and examines the distorted signal put out by the speaker. 

i “signature” of that articular system. m" al percep- 
"Bela pe have id used to test the functioning ` pe cud oe as 
tual system. Flicker fusion and apparent nemen jn diat shutter, 
examples. When a person views a light through spaces da however, ooo 
“flicker” is perceived at low rates of rotation. At high "- ele the subject 
terruption or flicker is noticed. As we increase the rate QU viam dii ach 
can report the point where the flicker just w^ cepe aces. It has been 
old" is fairly stable, and there are great individual -: eon of the nerv- 
suggested that this fusion point provides an index of Bud gen acm: d 
ous system to register details of incoming stimulation, " nage. Halstea 
flicker fusion have been found useful in diagnosis of brain d i consciousness 
(1951) comments that fusion "represents a dramatic change ela Bashes + °° 
for the subject. For once he reaches the rate at which Ta > one. He bas 
fuse for him, he cannot tell the unsteady light from a steac dine; individ- 
broken with physical reality. The rate is much higher in our n ent] engine 
uals than in our fronta] brain-injured patients. It is as if the ingen It fails 
were running in the brain-injured, but running on inadequate p 


important as 
at the first little hill. ' + + It seems clear that the test reflects an 
pect of cerebral metabolism.” 


When two lights, side by side, 
the light appears to jump b. 
heimer called it, is the b 
and neon signs, In app 
lation into a pattern, 
terval between flashe 
where apparent mov. 
appears, beyond wh 
tion of these thresh 
sion of movement, 

Other tests deal 


cession: 


" ick suc 
are flashed on and off in quick s s Wert 


ra non,” à 
ack and forth. This phi e stus quat 
asis for traveling light patterns in - viden stimu- 
arent movement the nervous system pe ni the i? 
Klein and Schlesinger (1951) arrange 


sho 
: ite thre$7", 
S. As the interval is reduced there is a defini e it dis 


" 
separ 


es" 
, impr 
" it an I 
olds, i.e., the range of intervals which perm 


quee’ 


ta 
cessful adap n 
g the same blind alley in a maze. Har 
ing perceptual fields on the basis of ne 


mui 
; erli? 
a imer in P 
Investigators associated with Wertheime 
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= icon invented the "Water Jar" or Einstellung test (Luchins, 1942). 
The : l ung may be translated approximately as “mental set" or “orientation.” 
aa. est makes use of water-jar problems like those of the Stanford-Binet: 
Las have a 7-quart jar and a 4-quart jar, how can you get exactly 10 
Des of water?" To follow the logic of the test, call the jars A(7) and B(4). 
» "es is: Fill A, fill B from A (leaving three quarts in A), empty B, fill 
adn A, fill A. Thus three quarts are obtained by the rule (A — B), and the 
ional seven quarts from A; ie, (A — B) + A= 10. The series of prob- 


lems ; 
ms in a test might be as follows: 


Jars 
To Be 
A B C Obtained Rule 
Example: 7 4 = 10 2A — B 

a. 21 127 3 100 B—A —2C 

b.14 163 25 99 B—A-2C 

c. 18 43 10 5 B—4A — 2C 

d. 9 42 6 21 B—A-2C 

e. 20 59 4 31 B—A—2C 

f. 23 49 3 20 B — A — 2C or A — C Critical 
g. 15 39 3 18 B — A — 2C or A + C Critical 
h. 28 76 3 25 A-C Extinction 
i. 18 48 4 22 B — A — 2C or A + C Critical 
i. 14 36 8 ó B — A — 2C or A — C Critical 


T ; 
us Subject may be given help in solving the first few problems. The long 


Se 
“ries of problems a to e solved by applying a particular rule builds up a 


Ment; 4 c PT 
ental set to use that formula. “Critical” and "extinction problems are then 
such as f, the "set" solution works but 


e the answer. In the extinction problem 
her simple rule can be found by the 


i 
ie eed In a critical problem 
h the s a much easier way to achiev 
exible . solution does not work, but anot 
subject. 
inmedi, t a good score in the Einstellung t 
E S problem, discarding memories 0 
Mation NIE indicates inability to separat 
ing m - Three other tests relevant to this spec! 
ay be described briefly. 
on work of Gottschaldt (19926) 


Pres € Embedded Figures test (EFT) based 
ents a strange geometric pattern and requires the subject to find it in a 


ar j : 
t ien Complex field. In some versions, the background is colored irregularly 
Crease confusion. The score is the time required to solve the problems 


itkin, 1950) 
i Stroop Color Word test uses three test sheets. The first sheet consists 


est the subject must attend to the 
f the previous solutions. Inflexible 
te conflicting sources of infor- 
fic aspect of mental function- 


P^ dots whose colors are to be named rapidly. The third and most im- 
in x Sheet again presents color names but this time the words are printed 
otor. The colors used conflict with the names, the word yellow being 
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printed in red, for example. The subject is required to call off the colors as 
rapidly as possible. Finally, he is asked to read the words on this sheet. The 
decline in speed from second to third trial (color naming) and from first e 
fourth (reading) indicates the degree to which conflicting cues block his 
thinking. 

The Rod and Frame test of Witkin (1949) is similar in psychological con- 


Ea 


= 
MD 
ES 


FIG. 95. Problems from an Embedded Figures test. 


ception, th. 3 ordinarily 
- buon, though radically different in stimulus material. A person O77 cy 
judges “which w. 


ay is up?” by combining visual and kinesthetic cues: £ . (o 
» z a e biec 
house" in which walls and objects are built at a slant requires the SU J 


i " SA peri 
disregard visual cues and rely wholly on bodily cues. In the Witkin h pir 
ment in a dark room, 


dan jin a © 
the subject is strapped into a fixed position 1° 
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which can be tilted. He is then asked to judge when an adjustable luminous 
rod several feet in front of him is in the upright position. The person's success 
on the rod alone is a measure of kinesthetic acuity. When the rod is placed 
within a luminous tilted square frame, however, visual cues tempt the sub- 
ject to call the rod vertical when it is parallel to the sides of the frame. Most 
subjects judging the rod-in-frame select as vertical a compromise position be- 
tween the gravitational vertical and the tilt of the frame. The greater the tilt 
of the chosen position, the less the subject has cast aside the irrelevant cue. 
A second test of the same nature is Witkin’s Tilting Room, where the entire 


chamber, in which the chair is mounted, can be tilted. 


Validity of Structured Tests 
have several common characteristics. Each is em- 


ployed to measure a stable personality trait. The investigator usually hopes 
to obtain, from a measurement at one moment in time, information on the in- 
dividual’s general level of persistence, rigidity, prejudice, ete. Performance 
tests have also been used to study how persistence, for example, varies over 
time or with changing experimental conditions. In discussing field observa- 
tions we noted that numerous samples are required to estimate typical be- 
havior, because any particular observation catches the subject only in one of 
Many possible situations. By standardizing the situation, eliminating the 
random variation from subject to subject, the performance test hopes to 


make extensi i necessary: 
acters ee one ee atic for the SS vn than 
9r any other personality measure. In the field Magie t e su ject is 
Blven no directions simply exhibiting whatever motivation he brings to his 
Wn affairs. In EFT. the Circle test of honesty, and in nearly every other per- 
i g measured. This pro- 


for itv is bein 
mance test, the subject is told that an ability is be! ; 
Vides him vat ü dime defined ideal of behavior; he understands, in 


each case, how he can earn a high score and understands thar i high score is 

esirable, The “good performanc e” referred to in the directions is not the 
Performance the investigator is observing; for example, the child who “raises 
AS Score” on the Circle test by peeking lowers his honesty score. A perform- 
ance test of personality tries to standardize motivation to the same degree 


that I dardized. Motivation is not 
“* motivati ntal ability is stan . i 
pudiese ir is it uniform in a school achievement 


ni à 
e m for every subject, but neither 1 

Orat i icemen. i 
est for selecting poli an ability component which is irrele- 


early a] 7 contain 4 
Van y all performance tests N examined. Some control for level of 


ability į i e Color Word test the reading rate without 
i for the interference measurement. Gen- 


The performance tests 
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eral reasoning or spatial ability accounts for as much of Embedded Figures 
performance as does difficulty in handling perceptual interference. Separa- 
tion of ability from personality factors in problem-solving tests is not eva 
and may not be reasonable to attempt. Embedded Figures correlates .92 Hg 
-60 with ability tests such as Block Design, Number Series, and Thurstone$ 
tests of the spatial factor. Yet these are problem-solving tests just because the 
answer is "hidden." If there were no interfering stimuli there would be no 
problem. eat 
Do perceptual and cognitive tests measure nothing but general ability 
This suspicion is readily dismissed. Although they overlap with ability -— 
particularly in samples of college students, they also carry information no 
predictable from ability tests. An entire battery of intellectual, spatial, an - 
psychomotor tests accounted for only half of the reliable variance in Em 
bedded Figures in one study (Guilford, 1947, pp. 895, 897). 


: »rsonali 
Performance tests are closely linked to specific theories about persona si 
and brain functioning i 


; in this they contrast with most questionnaires 
field observ: 


ations. Hartshorne and May assumed that character consisted ia 
collections of responses or habits and sought to measure those habits by od 
pling. The perceptual and cognitive tests assume that mental functioning i 
a definite overall Structure and seek to measure specific subprocesses. — 
each performance test relates to one narrow element in behavior, it ge 
little basis for describing the overt personality as a whole. Structured e 2 
broad traits like cheerfulness and friendliness do not exist and indeed ? 
difficult to imagine. 
The most important question about structured tests is their degre ith 
generality. If honesty is a generalized habit, the Circle test will correlate pe 
honesty in many more significant situations. If inability to separate ially 
streams of information is a general pattern of behavior, such superfici? % 
dissimilar tests as Color Word and Rod and Frame will correlate. And if p. 
We can expect the same trait to be important in many types of problem i 


ing. 


e of 


i» 
The i : T ions de? 
questionnaire obtains information by means of general questio the 


mg with average situations, for example: “Do you feel unhappy much «^ in- 
time?" The hope is that these superficial summary questions will ap gjet 
ferences to Specific situations, The Structured performance test, on ae ri ja 

hand, starts with a single artificial situation and hopes that behavior 1 tem^ 
Situation is largely determined by some fundamental quality of habih, tion 

PSEUD, Or brain structure which will influence response in t i 
having a very different surface appearance. The test is meaningless iasi if 
measures something accurately. It is equally pointless to develop e atedly 
it measures behavior only in this specific task. If it correlates € rabl 

with tasks which are on the surface quite dissimilar, it takes on conside 
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psychological interest. It can then be regarded as measuring a difference be- 
tween individuals which is in some sense fundamental. Once this is estab- 


lished, a program of validation should be undertaken to determine just how 


much of socially significant behavior such a score can account for. 

The developers of performance tests must study the potential usefulness 
of their methods in three stages: establishing that the test measures some 
characteristic dependably (reliability studies), establishing that the charac- 
teristic is found in distinctly different tasks or situations (generality ), and es- 
tablishing that the characteristic is related to socially significant aspects of 


behavi d 
havior (criterion-oriented validity ). 


“official” version of a performance test, and we 


Ses cannot point to any one is a] 2A A ode 

no manual systematically reporting its technical qualities. Each investi 

Sator modifies a test for his own purposes and reports such findings about 

reliability and validity as emerge from studies of his own theories. For this 

reason, we can give only illustrative rather than definitive evaluations of the 

tests we have described. 

7. What “ability” is involved in the Rod and Frame test and how may it be cor- 
rected for? 

8. Are the tasks used in performance tes 
im those used in aptitude tests? 

* Discuss this statement: "A test prese 
be regarded as a measure of abili 
Hettüanve will actually be score 

oes " 
10. The Cool we tends to produce U-shaped distributions rather than normal 
distributions. Is this an advantage or ? disadvantage? 


ts of personality any more "artificial" 
nted to the subject as a test of ability should 
ty if the subject knows what aspect of his 
d, and as a measure of personality if he 


a performance test calls for 


arrangements on the same occasion. 


Ome performance tests are quite reliable and others are quite unreliable. A 
ET coefficient of equivalence may indicate that the proposed test 18 too brief 
9 be a good measure, or that there o common characteristic running 
gh performance Bn different items. A high coefficient of Sones implies 
at we are getting an accurate score for the individual's standing at this 
me. Table 67 illustrates some reported coefficients of equivalence. Evi- 
ently one can measure many traits with very satisfactory precision, but reli- 
ability cannot ba eken for granted. Users of performance tests often neglect 
9 check selisibilhfus and hs is very lik reliability acounts for the 
ac of many experimenters to discover Si ificant correlations and differ- 
Ses with such tests. 
Mig ina a performance test 
“Metimes impossible. In the Water 


ise 
Overs that the solution rule changes 


Reliability. A coefficient of equivalence for 


COrre]a s; g : 
elating different trials or stimulus 


is n 
throy 
th 


ti 


to improve reliability measurement is 
Jar test, for example, once the subject 
for different problems, subsequent 
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TABLE 67. Coefficients of Equivalence for Selected Performance Tests 


Scores Remarks and 
Test Compared Coefficient Source 
Red and Frame Rod and Frame .64, .52 (Witkin, 1949) 
with Tilting 
Room 
Rod and Frame Eight odd vs. eight .99, .98 (Gardner ef al., un- 
even trials, cor- published) 
rected 
Deception tests .62 to .87 Scores corrected for 
differences in rel- 
evant abilities 
(Hartshorne and 
May, 1928) 
Critical flicker fre- Thresholds on six  .98 in (Irvine, 1954) 


quency of paretics trials 
and schizophrenics 
Embedded Figures Odd 


each group 


vs. even .68to.88  Subtests composed of 
items, corrected more-embedded 
figures correlate 
only .35 with easy 
figures (Gardner 
et al., unpublished) 


H : i jer 
trials are likely to show much less "rigidity." There is no way to aped 
subject's initial naïveté to obtain an equivalent trial. The only way to €* D 
the sample of behavior is to find a second task which measures the $° 


quality. 

If the coefficient of equivalence is ] i 
measurement, the coefficient of stability tells whether the quality e 
measured is a stable one, Stability is desirable when we intend to aed 
the trait as a long-standing, generally significant aspect of personality: 

g mood, temporary inefficiency of thinking, ° 


rate 
ecura 
arge enough to guarantee 2 


TABLE 68, Coefficients of Stability for Selected Performance Tests 
Cheating Six months 75-79 (Hartshorne and Mayr 
Cheatin 1928, II, 2099 
g Early adolescence to .37 (V. Jones, 1946) 
Rod and Frame adulthood 


More than one year 86 (Witkin, 1949) 
— 12:2 
leave no doubt that some performance tests measure characteristics ? 
siderable permane: 


nce, n p% 
Co > H : ee 

f rre koha wih Nonperformance Measures. Positive relations p re á 

ormance tests and other types of personality measures are conside 
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dence that the tests are getting at personality variables. Pemberton (1952a) 
administered several questionnaires to subjects who also took the Embedded 
Figures test. There were 29 significant correlations between self-reports and 
EFT, indicating that good performers on EFT describe themselves as fol- 


lows: 


I stay in the background 
Tam not sensitive to social undercurrents ; 
I am not interested in humanitarian occupations 
: do not feel need to apologize for wrong-doing 
am not conventional 
Ihave high theoretical interests and values 
Tam interested in physical sciences 
t least, good EFT performance is associated 


It appears that in this sample, à 
t judge the strength of 


te self-centeredness or noncon 
us relationship from Pemberton’s report. . 
f L. Ainsworth (1958) correlated rigidity on the Water Jar test with a ques- 
tionnaire on insecurity or general life adjustment. For 120 students ina Brit- 
ish university, the correlation was 94, An important subsidiary finding was 
^" the performance of insecure students changed as the test was admin- 
istered with various degrees of emotional pressure. As stress increased, the 
Most insecure students actually showed greater adaptation on critical trials 


th hr 
lan they had shown under minimum stress. — 
Generality, In one sense, every performance test appears to be ighly 


Specific, Very slight modifications in conditions of administration, scoring 
Procedure, task, or sample produce significant differences in the meaning of 
results. In the Tilting Room, errors have different correlations and different 


Psychological implications depending on whether the trial begins with both 
Chair and room tilted to the left or with the two tilted in opposite directions 


(Witkin, 1949). This has a theoretical explanation, since the degree of con- 
r greater in the second case. In the Water Jar 
o 


ie 
ia Tobia cue systems is fa a.c” solution works but is unnec- 
» behavior on critical trials (where the “set” solu n 
essary roundabout) has different psychological prop jon and Correlates 
m behavior on extinction trials (where the “set” solution does not work) 
- Ainsworth . Ba 1956). 4 " 
Most of eie eden tests A personality are used in many different 
versions, One EFT presents key figures one icd pedes places soviet 
€Y figures before the subject at once; ina third, the subject is to look for the 
^ad key figure in every item. One may keep the key before the subject dur- 
ing his search ire him to remember the key figure while examin- 
a dies of the promising tests 


Ing thi i ing stu 
e complex figure. We need engineering 5 
9 determine iE is potentially most valuable. Even where modifi- 


Cation ; i ; form and a standard form 
tion is desired, it appears necessary that the new 


formity. We canno 
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> would 
be used side by side to evaluate the importance of the chan ge. ^» a 
think of adopting a modified Wechsler or Binet eae Pe » 
ing it systematically with the original; yet the tradition of pea " 2 
almost entirely neglected in performance testing of personal ity. — 
To summarize the research on generality is next to impossible. eo ne 
traits such as "rigidity" there have been a dozen papers E hat 
lations, and even the summaries are in disagreement. Some gone i pi 
there is a general trait of rigidity, some find three or four Higa pom 
(never the same from study to study), and some argue that ber ax 
cept of rigidity is invalidated by the data. Levitt (1956) yu pem of 
than thirty correlations of the Water Jar test with other alleged inp 
rigidity and found that negative results outnumbered significant co tigators 
three to one. Such a finding must be regarded as evidence that et Sti 
have been too quick to label a test as a measure of rigidity. Among ae 
which have been claimed to measure rigidity were an anxiety dohier 
naire, a questionnaire measure of rigidity, the California F scale, cares 10 
Similarities, and mirror writing of words. To expect all these meas 


agree reflects an unreasonably simple vie 
Luchins ( 


those who 


w of mental organization. = 
1951), though he popularized the Water Jar test, is rcs i 
are content to measure a trait of "rigidity." He regards - ift (as 
àn observation of mental process, and would expect its meaning to shi v ad 
Ainsworth found) under different conditions. Those who seek to measu 
abstract trait underlying the test performance 


-blem i$ 
err in assuming that every Einstellung solution to a test prob 


vidit O 
brought about by the same psychological process—namely, HEU is 
behavior. . . , Moreover, the alleged rigidity in solving the p or o 
taken as an indication of rigidity in the respondent's m A pos 
rigidity in his ego-defense System. His behavior is rigid because ^ thing 
Sesses rigidity. One is reminded of the outmoded belief that a for in 
burns because it has fire init... , Rigidity of behavior is sought fielc 
the respondent; it is considered as relatively independent of de 
conditions under Which the individual is operating. i 

: * « I do not think that there is anything inherently ur rum 
tempting to determine within a short period of time, a few hO 


H i i in : 
testing, the probability that an individual will shift his behavior At the 
life situations in order t 


present time the most fru 
Observation of and exp 


th at 
O: 
ea 


o meet changing circumstances. . - i 
itful approach seems to me to involve i un 
erimentation with rigidity of below of th 
various conditions, if possible suspending biases as to the oat system 
behavior involved, |... he aim should be to vary conditions Y grst, 
atically and to observe what happens. As a final step—and not as 


tensive 
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step as 1S SO C! ay—one ma be able to propose an ex ana- 
SS 
ommon today: n y P p expla a 


tion for such behavior. 
pe of test does not appeal 


dividual differences. They 
al diverse tests. They 


Devote : ; 
oted, painstaking exploration of a single ty 
late important in 


toi 
nvesti 

estigators who want to iso 
among sever 


mud = some general dimension 
Counts for à ei me to identity a quality which is "in the person" yet ac- 
attack rris aavior over a wide range of situations H conditions. This line of 
Other has im vage. setbacks; one attempt to establish generality after an- 
ciently fafie e à Xu the approach is not entire t success, and suffi- 
tifying iy otra of tests and theories has a reasonable chance of iden- 
Super m : p traits: Any consistent positive correlation between two 
ally dissimilar tests encourages continued effort to define and clarify 


the à 
E ac 

encies, Witki relations do commonl 
men but o » ( 1949) finds that EF 
tion of .65 E y .21 for women. Gardner (unpu 
Srences in re Women but a near-Zero correlation 
tribute to A and subtle differences in 
generally ] nese inconsistencies. One can concl 
them, “a UL correlations consistent with the 
Producibl 2 correlations are often low. We are far from having the re- 
Performa e high correlations which would permit u$ to argue that any two 
ance scores measure the same factor of personality. 


Ea 

c : H . 

h task has its specific elements, and à satisfactory general measure 
ining short trials on various tests each 


wil 

ee be built up by combin ee 

inal ond the same common element along with different specifics. The orig- 

Sion, Th shorne-May studies of generality in character support this conclu- 

of bahaya were the first to cast doubt upon the assumption that general traits 
avior can readily be measured by one or two specific samples. Although 


the 

S i " 

© Specific tests were reliable, different measures of deception correlated 
between cheating on 2 classroom test 


ittle w; 
Ww . 
ith each other. The correlation 
was only 50 even after correction for 


and 
on : : 
Unre]j oe Circle test of coórdination | 
ability. These data contradicted the notion that honesty is a unified 
ing situation. Fu 


trai à 
di can be measured in any tempting Sí 
16 aa different character tests Were 
Ntercorr at a generalized “good character & 

e* elations of honesty; coöperation, and so on We 

ig ( E factor" in character has small influence on 2 

S and May, 1990). 
3) r results are found for persistence (^ 

- MacArthur obtained 21 measures on Englis 


ly withou 


here are puzzling inconsist- 


y occur, but t 
'T correlates with Rod and Frame .64 for 
blished data) finds a correla- 


for men. Small samples, dif- 
subject motivation all con- 
ude that performance tests 
theories offered to explain 


re only about .25. 
ny specific behav- 


( Thornton, 1989; MacArthur, 
h schoolboys, all the tests 
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-v large pro- 
being presumed to have something to do with ee. cen e im 
portion of the intercorrelations were insignificant, though so ae percent 
were in the .50-.60 range. The general factor accounted for on haere 
of the performance on the collection of tests, Among the bow spent in 
seemed to be good measures of "general persistence i Fema iin 
completing a magic number square, time spent on a t ranis Or 
sional wooden puzzle (Japanese Cross), and ratings [9 pe vem of gen- 
teachers and peers. Combining eight scores could provide a ne: su factors: 
eral persistence with reliability .79. MacArthur also found four gr y Art 
One factor linked tests where pupils had a chance to see if pin E stand- 
were still working (in contrast to those where each had to SEE jose de- 
ards). This factor MacArthur named “social suggestibility in yc ^ 
manding persistence." Reputation measures formed a Lese] tasks 
more factors were required to account for persistence bs in : to a large 
and persistence in physical tasks. It is evident that persistence is 
degree situational. : lace. They 
It should not be concluded that tests of specific traits have no Vn research 
are invaluable for many research purposes, and the findings of gei us that 
may have practical significance. Maller (J. McV. Hunt, 1944) S working 
the Hartshorne-May findings on character led one national apara showet 
with youth to revise its program completely, because the rin y haracte!” 
that those who had received most recognition in the agency $ ri is not 
building activities were on the average most likely to cheat. T n com 
hard to explain when we consider th tem fro? 
petition, and working for high scores even in a puzzle test, may $ 
the same basic feeling of inadequacy. tests? 
There are three ways to interpret structured performance form 
© They can be regarded as specific measures of one type of pe used # 
defined only by the Operations used in measuring. When they Mrd are 
dependent variables in psychological experiments, any positive per canno 
likely to be of ultimate theoretical importance even though the 
at present be interpreted in terms of general attributes. tis But peoaus 
9 They can be used to measure general traits of personality. x si suc? 
of the low correlations among tests of the same supposed trat A" esting 
measurement requires a composite of diverse tasks. peram as pi 
of any personality construct seems to require a “hodgepodge ped pr? x 
invented to measure intelligence when no single type of pro 
adequate. ] otica 
€ They can be used singly or in combination to predict q imp 
portant variables, Any significant findings (see below) would "struct 
whether or not the tests could be interpreted in terms of co" 


-- - recognition i 
at striving for recognit 


ance 
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11. 
Analyze the story-completion test of persistence, identifying all the factors 


12. Edi a cause one fifth-grader to earn a higher score than another. 
inm ing to Studies in Deceit (Hartshorne and May, 1928), children from 
Sine DM low socioeconomic status cheat more on achievement tests than 
alim * ren (r — 9). What factors must be taken into account before con- 
Hisp g that these children are more likely to violate standards of good con- 
with socially important criteria have 
as any evaluation program 
usions are summarized by 


QUEM Correlates. Comparisons 
hx denis aud unsystematic. As comprehensive 
Melt n the work of the Air Force, whose concl 
on (1947, pp. 848-849): 
iation Psychology Pro- 
didate to emotion-pro- 
ation of such stimuli 
some psychomotor 


made during the Av 
he reaction of the can 
through the applic 
of performance on 


A continuing effort was 
ges to obtain a test of tl 
ucing stimuli, either directly 


as distractions during the course 
task or indirectly through the measurement of muscular tension or other 
bles. ... The available data do not support 


psychophysiological varia 
the hypothesis that addition 
elementary pilot training accrues 
and other distractions, including presumably 


are administered. 


for the prediction of success in 
tion when verbal threats 


fear-producing stimuli, 


al validity 
to a test situa! 


validities of .20-.30, it overlapped 


Tho 
n ugh the Operational Stress Test had 
much with ability tests that it made no useful contribution to prediction. 


at grictured tests have been widely applied in clinical research, and many 
Masi demonstrate differences between patients and normals or among 
cate Ris of different types. Such r s are difficult to interpret; diagnostic 

Bories have uncertain psychol e, and results can often 


e attri : 
s ibuted to differences in coóperat tion rather than to more 
amental psychological processes- Burdock, Sutton, and Zubin ( 1958) 


Sum 
mari f v itive r ions; 5 
rize much tentative evidence for various positive relations; for exam 


le, d; í à 
Ple, discharge of schizophrenics from hospital is predicted by low flicker- 
ce on the Color Word test. Confirma- 


us 
ia and good performan oe N eie 
ay ah these results would have obvious practical importance. a 2 ee 
ing i asis for theories about qualities which predispose to recovery. Accord- 
© Burdock and his colleagues: previous applications of performance tests 
atically, have not distin- 
and have failed to com- 


o E 
a have picked variables too unsystem 
ned conceptual from perceptual performances» 
ures of physiological and 


Pare 
£3 3 ei H » 
omplex performances against *baseline" meas 


neu 
rologi 
» "oie functioning. 
T 
Ormance measures of perso 


esult 
ogical significance 
ion and atten 


nality are related to social-psychological 
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variables. Linton (1955), for example, measured attitudes before and "e 
reading a biased but allegedly authoritative article. An index based on a 
Rod and Frame and Embedded Figures tests together correlated .66 wi d 
change scores. Subjects who could not disentangle relevant from irrelevan 
stimuli were most easily persuaded. dud 

Significant predictions of resistance to propaganda or of recovery iT " 
schizophrenia are illustrative of the many studies which encourage XP : 
for practical application of performance tests. None of the relationships, js 
ever, is well confirmed. Few correlations are checked by repeat ipm 
and not infrequently a repeat study fails to confirm the initial finding. ek 
can find no correlation between a performance test and a practical re E 
that is at present well enough established to warrant basing individua 
administrative decisions on the test. 


OBSERVATION OF COMPLEX PERFORMANCE 


" hich 
We turn from the highly structured measures of single traits to tests sl 
assess style of performance in relatively complex tasks. 


Problem Solving 


During an ability test such as Block Design one can observe method A 
attack and response to frustration. Better information is obtained by ih er 
fying the test and by specifying precisely what is to be observed. ae » 
(1957) defined "method of attack" in terms of two more definite vade 
“whole-part approach” and rigidity. He used six tests: a modified Block 


; 4 cor 
sign test; the Arthur Stencil Design test, in which cutouts of various 


ors must be superimposed to form a specified pattern; Anagrams p 
in which the subject builds numerous words from a set of letters; Ra 
grams II, Tequiring identification of a scrambled ten-letter word; the t is 
schach inkblots; and a “Fu 


ed 
nction test" In the last named, the subj 


€e i ar 
ed What are the possible different uses of (box, broom, P ute 
Per)?” Goldner developed Scoring rules for each test. For the Function 
the whole- 


part score was assigned according to whether the answer ei fire 
bie object (“Put things in the box”) or broke it into parts (“Use 3 erso? 
Wood"). In the Block Design test, a “whole” attack is shown hy the Pod 


an 
who turns each block to the correct face before beginning eae a 
then assembles the patter 


etc. 
"Dart? n as à unit, paying attention to symmetry; 5 adds 
par Approach is shown by the person who starts at one corner an¢ ' er- 


in Hla ck at a time, building up the pattern piecemeal. To bring o s test 
ences in approach, Goldn Kon 


z er made three changes from the original 
He used irregular, nons 


a 
$ e 
quare designs so as to make analysis of the P 
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more difficult. He presented every design, whether it used nine or sixteen 
blocks, in the same size, so that the subject had to decide how many blocks 
to use. 

Goldner also judged rigidity. In Block Design, for example, rigidity was 
Scored if the subject had difficulty in judging the correct number of blocks, 
retained the same attack after a failure, or gave up without finishing a prob- 
lem. In the Function test, rigidity was identified with a tendency to give 
many logically similar uses (^as a tool box,” “for mailing," "to pack things") 
Whereas flexibility was identified with variety. 

This technique contrasts with the Water Jar test used as a measure of ri- 
Bidity. In that simpler task the score reports a particular countable symp- 
tom. In Goldner's battery each score is a rating of an assumed mental proc- 
ess which can appear in the performance in several different ways. The 
more complex task "spreads out" performance so that mental process can be 


inferred, 
Goldner found substantial support for the hypothesis of generality of the 
two traits observed. The results above the diagonal in Table 69 show that 


TABLE 69. Generality Among Tests of Problem-Solving Style 


Total 
"Whole- 
Ana- Ana- Part" Score 
i Block Minus Par- 
Func- grams grams Stencil c V 
ticular Test 
a A LS Design ticular Tes 
l 40 58 40 67 
une $ —02 —02 10 08 
die 53 36 48 
P id " —03 27 
SnSgrams Il 0 25 o = 
Block Design 17 32 -1 
ock Design 26 16 00 
sa Rigidity score 
inus " 
je Pme i a 43 06 30 58 54 
e for whole-part scores, those below diagonal for rividity 


Norg 
: Correlati ve diagonal ar 
"relations above diagona’ emi 


Sc 
"Soy, Correlations in boldface are si 
CE: Goldner, 1957, p. 14- 


five of his six "whole-part" scores are correlated. Each test agrees aris 2 
tially with the total of the other measures. Particularly striking is the correla- 


tion between tasks as dissimilar as Anagrams I and Stencil Design. The 
Function test is an exception; whole approach on this test either is unreli- 
able or is a different trait from whole approach on the other tasks. Goldne: s 
findings on rigidity are quite similar: five tasks have marked correlations 
with each other: This time, however, Anagrams I is i saa! Í 
Concept-formation tests have been specially designed for a T : p 
malities of thought processes. The Hanfmann-Kasanin test may be ta en as 
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: ; heights, 
an example. Twenty-two wooden blocks of several met. eem iex js 
and sizes are placed on the table. On the hidden undersic e rw (all mur 
printed a nonsense syllable. The syllable defines a type ny t the blocks are 
blocks are small and tall). The examiner tells the subject t is sige 
of four different kinds, one of which is named mur, and that 1e : sentis us 
the basis for classification and sort the blocks. The xii pil nm indie 
likes to discover the classification, except that he may not inver s xn medi 
look at the name. After each trial sorting, the examiner points oo ved and tie 
and asks for another trial. This goes on until the problem is ^ cn subject 
principle of classification is stated. Observation shows st eee hy- 
uses a logical hypothesis (“perhaps all mur are triangles » - : orech 
pothesis, or pure guessing. One observes ability to profit sessi pis and 
ability to discard a false set and form a new concept, bizarr ep | score is the 
verbalizations, and so on. More significant than the M ene 
insight and conceptual thinking displayed. The theory ies inda cach ob- 
schizophrenics are unable to think abstractly and must respon 
ject in the environment as a separate thing. — 

Though clinical groups differ on conceptual tests, they ar show norma 
infallible diagnostic indicators. Some brain-injured erem ene patients 
concept formation; and tests of either schizophrenics or af ye ; ain damage 
with low ability may easily be misinterpreted as indicative b d for im- 
(Zangwill, in Buros, 1949, p. 79). The designers of such en ‘ eem. but 
pressionistic analysis as the only dependable method of interpr tained in ? 
reviewers conclude that the advantages of the tests could be p 
Strictly objective Scoring of processes Observed (A. J. Yates, 1 


ora highly 
em 

13. Which of Goldner's tests are most structured? Do they correlat 

with each other than with unstructured tests? 


ans 


Perception 


The Bender Test, 
ranges, and re 
Gestalt Test ( 


have mentione 


P akes in, 
Perceptual tests examine how the subject ed Motor 
ports information, One such test is the Bender her tests we 
or Bender-Gestalt). The Bender, like many pts ( Bende? 
d, grows out of Gestaltist research on aont , 
1938). Figures with different patterns of organization, inclu a b 
Figure 96, are shown. The tester asks the subject to copy the set, . 
Serves his mode or attack and his success. toward men 

Each structured Personality test was pointed specifically _ able to sem 

uring some Single trait, Goldner, using more complex tests, aio cannot ^ 
each one for two qualities, approach and rigidity. The Ben onses 9 pi 
characterized as à measure of any one trait or set of traits. —— atte 
dividuals may differ in 4 hundred different ways, which the tes 
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to observe. c " 
E m x collate, and interpret. Scoring rules have been developed (Pascal 
a uttell, 1951), but the clinician i 
: a generall ' attempts a qualitative i j 
go (see Chapter 19). ' NODIS 
erformance m: is 
rformance may be treated statistically by observing "signs" which char. 


American 


erns to be copied. (Copyright 1946, 
Lauretta 


ced by permission of Dr. 
ociation.) 


FIG. 96. Bender-Gestalt patt 
Orthopsychiatric Association. 
Bender and the American Or 


Reprodu 
thopsychiatric Ass 


acteri vas is : ; 
terize some criterion group. Gobetz (1953) listed behaviors which distin- 


uishe " 
guished neurotics from normals, for example: 


sie slope in reproducing rows of dots 
i in number of wave crests 
ounting aloud during reproduction 


Figures crowded into half of page 
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When Gobetz counted the number of such signs he found a significant iei 
ence: 19 percent of a cross-validation group of neurotics showed pae 
more signs, compared to 4 percent of normals. Many of the signs p s 
by other authors as characteristic of neurotics did not differentiate sé 
study. Gobetz concluded that the test could be helpful in locating emo sit 
ally disturbed persons, provided that other data were used to confirm : 
diagnosis of maladjustment. 


14. Judging from Gobetz' signs, what traits does the Bender measure? 


dios ith that for 
15. Compare the Screening effectiveness of Gobetz scoring system with tha 
the MMPI (p. 479). 


; ubject 
The Rorschach Test. The Bender is a test of mental efficiency. bos t gr 
is set a simple, objective, and literal task; any emotional doing iem 
ast, "ren 
tal processes can impair the performance. In the Rorschach test, 


"set is asked to tell 
plays a much smaller role in guiding responses. The subject is asked jt io 
what he sees in ten inkblot: 


numerable interpretations. 


Sponse with their bloody 
with their forms su 


3 : ax i erm 
5, blots whose form is so irregular as to r lIe 

jona 
The blots are calculated to arouse emo 


an 
: E rays. 
reds, ominous blacks, and luminous g pk sex 


ingly normal individuals. 


H n i s, » 
The technique was invented by Hermann Rorschach, a Swiss P A an 
trist. He used the blots for “an experimental study of form perceptio ing t° 
found that patients of diffe 


rent types had different ways of saponi 
the blots. His diagnostic method, published in 1921, has been elabora sycbi- 
subsequent investigators, with a shift of emphasis from attempted Petes 
atric classification to a description of psychodynamics. In the mcd sys 
S. J. Beck (1945, 1952) has made minor changes in Rorschach's parie sys" 
de norms; a more radical revision of the scor! 
y Bruno Klopfer ( Klopfer et al., 1954). " 
ed very extensively in clinical testing, even oy dii 
is Seriously questioned by nonclinical ope n ps chia- 
inicians. The Rorschach became prominent — In th 
: at interest in the descriptions of personality it y1€ tal hosp". 
1940's, when Psychologists were first used in large numbers in ertt s 
ment centers, the chief duty assigned them by the me a result 
was the administration of intelligence tests and Rorschachs. AS 1 psy” 


ete inica 

training in Rorschach interpretation became a requirement for cli 

chologists, irl ob 
fairly 


hia- 


hough 


à ; ing to 
The interpretation begins with a systematic scoring according t 
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jective rules. There are about a dozen major scores, which fall into three ma- 
Jor categories: location, determinants, and content. Location scores indicate 
Whether the response uses the whole blot (W), commonly perceived subdi- 
visions (D), or unusual details (Dd). The “determinants” are the shape, 
color, and shading of the blot which the subject takes into account. “Move- 
ment ( M ),” for example, is scored when the subject describes humans in mo- 
tion, and CF when the response depends on both form and color, with color 
the more significant in determining the response. The scorer also notes how 


well the response fits the form of the blot, scoring form quality + or —. Fi- 


nally, the content score notes whether the response refers to persons (H), 


Parts of persons (Hd), clothing (Cg), etc. 
The scoring of four responses will illustrate the procedure. Card X is a mix- 


ture of brightly colored forms. Suppose that these four responses are given: 


l. A big splashy print design for a summer dress. 

2. Enlarged photograph of a snowflake [refers to 
area]. 

8. Two little boys blowing bubbles. You jus 

4. Head of a rabbit. 


The Scoring of these responses (based on supplementary explanation ob- 


tained by inquiry) is: 


a large irregularly shaped 


t see them from the waist up. 


Determinant and 


Response Location Form Level Content 
1 wW CFF Art, Cg 
2 D f- Nature 
3 D M+ Hd 
4 D F+ Ad 


Norms can be collected for Rorschach scores, but to be meaningful the norms 
must be based on subgroups of the population rather than people in gen- 
eral. The procedure for testing can also be standardized. Interpretation, 


however, has never been reduced to a systematic procedure. 
icates something both about the subject’s in- 


tellectual level and about the effort and carefulness he puts into an intellec- 
tual task. Much is made of the subject's control over his impulses and his 
€motional reactions. In Rorschach interpretation, movement responses are 
thought to represent imagination and creative impulses arising from within, 
and color responses are thought to represent emotional reactions to external 
Stimuli. “Form” is equated with ability to take reality into account. A person 
Who harmonizes form and movement is said to accept and use constructively 
his inner impulses; a person who rarely reports a movement response is re- 
Barded as lacking in imagination or as repressing it. Hertz (1942) expresses 
aà view shared by most specialists in Rorschach method: 


The quality of responses ind 
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In the final analysis the procedure of the interpretation in terms : 
other clinical and test data defies standardization, as Rorschach n 
nally contended. The information gleaned from the Rorschach Ege : 
is projected against family background, education, training, 1 E 
tory, past life, qualitative judgments of the examiner and of ot ner P d 
ple, and other clinical and test data. This is then interpreted in dont i. 
the examiner's experimental knowledge of the dynamics of huma ed 
havior. Final conclusions are made by inference and analogy piis : 
upon the experience, ingenuity, the fertility of insight, and, peter 
forgotten, the common sense of the examiner. Prolonged and ex hal 
experience is necessary, not only with human personality but K i 
kinds of clinical problems. This last step by definition, therefore, i and 
sonal to the examiner and subjective in view. It permits no norms, 
it eludes all standardization. 


a 
The impressionistic interpretation is constructed through the ber! n 
great number of interrelated hypotheses about the internal forces iai must 
trols which lead to each type of response. Each of these hypotees jeri- 
ultimately be verified to make interpretation trustworthy. Successful e n 
ence with individual cases is the chief basis on which users defend the oth- 
Schach method. There has been considerable formal research on the a 
eses, but the complexity of the problems posed by the tests po D. 
comprehensive research exceedingly difficult (Cronbach, 1949; a^ i 


; ce 35 
Ainsworth, in Klopfer et al., 1954, pp. 405-500). Sometimes the eviden 
strikingly favorable. 


indica" 
Rorschach (1921; see 1942, p. 7) s 


; " re 

aid that movement responses à hose i” 
tive of personalities that "function more in the intellectual sphere, y int S 
terests gravitate more towar 


ds their intrapsychic living rather than hecke 
5." This introversion interpretation was bigis an 
ployed a psychophysical technique to © mpare 

hach's but much more reliable. This score was ©° ith 


wi 
ftr rsons 
by clinical assessors using other data. The pe interes 


the world outside themselve: 
by Barron (1955), who em 
M score, like Rorsc! 
with ratings made 


Strong M tendencies were described as inventive, having wide M ái 
Introspective, concerned with se]f as object, valuing cognitive m jon 

low M Subjects were described as practical, stubborn, preferring a Ror 
contemplation, inflexible in thought and action. Although this SUPP igen? 
schach’s interpretation, Klopfer’s use of M as a prime indicator of inte obje* 
is questioned; there was no correlation between Barron's M score 8? t ass 
tive tests of intelligence and originality. To be sure, the psycholog’ test 


«activ! T 
sors rated the high M’s as more intelligent, but in view of the mE intel 
results this implies that assessors are biased toward judging perso onfi 
gent if they appear “though 


que ixed € 
tful.” In another study, similar mixe 


PERFORMANCE TESTS OF PERSONALITY 565 


is found. S responses, which interpret the white 
Space between the blots, are presumed to indicate oppositional tendencies, 
and Bandura (1954) found a correlation of .35 between the S score and rat- 
ings of negativism. No support was found, however, for the configural hy- 
pothesis that the meaning of this oppositional tendency depends on the M:C 
balance, S in high M subjects implying self-criticism, and in high C subjects 


implying opposition to others. (See also D. C. Murray, 1957. )* 
ited, each dealing with one bit 


Hundreds of additional studies could be ci 
of Rorschach theory. The trend of the results (Benton, 1950; Holtzman 
et al., 1954; Sarason, 1954) is this: 

€ About half of the experimental tests of Rorschach hypotheses give re- 
sults consistent with clinical theory. The interpretation certainly has "validity 


greater than chance.” 

€ These confirmations indicate rather small degrees of relationship be- 
tween Rorschach indicators and postulated traits. (Bandura’s correlation of 
‘35 is typical). Many different personality factors and abilities influence 
any one score, and no direct trait interpretation can be made with confi- 
dence. 

© Some aspects of the theory are definitely incorrect and should be re- 
Vised, 

Just how adequate the tes 


istic assessment will be consic 
There have been attempts to use single quantitative scores from the Ror- 


Schach either as trait measures or ^$ empirical predictors. Sets of "signs" have 


een developed, for example, to identify persons with organic brain damage 
(A. J. Yates 1954; Fisher, Gonda, and Little, 1954). Generally, these pro- 


Posed special formulas prove valueless on cross-validation, either showing 


No validity or having too high a false positive rate to be useful. Trait scores 
Such as Goldner’s measures of approach and rigidity sometimes correlate 
With external criteria, but the correlations are generally too small to warrant 
Use of the Rorscha eqn 4 quantitative measure. Some investigators feel that 
these deficiencies could be removed by redesigning the test in order to ob- 
ina better sample of behavior. Holtzman (1958) has prepared two paral- 
el sets of 45 blots and developed 2 scoring system for them. The subject 
ives one response "Á each blot. Score reliability is expected to surpass that 


of the conventional ten-blot test; but validity remains to be determined. 


à , 
A . x. of the disagre e 
studies Seer posi qp em pi (1959) on “the social psychology of 
aine a stu ; K 

i me hach validation. » When niversity psychologists x ko 70 percent of the time 

ome tations of the Rorschach, their predictions are EY rings succeed or aly 50 percent 
of ihe dps studies by psychologists working p im beue validity, however, they 
men ns When vorn cines whereas nt al citi SE 
Y One-third of their attempts to validate the test against nmaa r 


tion of Rorschach theory 


t, with these limitations, is for global impression- 


lered in Chapter 19. 
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indi ersonality 
16. Why might a "perfect" accuracy score, 100 percent F+, indicate a p 
Pattern undesirable for many situations? 


i i bility or 
17. Do the Rorschach scores related to quality of output reveal maximum a 


i i t in that 
typical behavior? How does the Rorschach compare with the Binet test i 
respect? 


í stitute 

18. "Responses to ten inkblots, presented by one tester on one Wwe uem dit 

too small a sample of behavior to measure any intellectual or wand this 

reliably." Do you agree? To what extent does Holtzman's test ove 

objection? WA if the 
19. It i! Pointed out (p. 131) that lengthening a test improves validity pins P 

Score is a pure measure of the quality measured by the criterion. Heimans 

this principle, would increasing the sample of inkblot responses in 

manner be expected to have much or little effect on validity? 


Group Behavior 


jc 
Group Discussion. The Leaderless Group Discussion (LGD) isa ae 
observation procedure used to Study social behavior (Bion, 19: vi a 
group of persons, perhaps applying for the same job, are told to dise veli 
certain problem (e. how to increase movie attendance). cease! 
predetermined aspects of each member's performance. The LGD is uns 


de- 
tured: no rules of procedure are established, the topic is left largely un 
fined, and the group, b 


rns 
i s, : * " " ic patter 
ship or dominance relations, During the discussion, however, social p 


‘milar tO 
the person plays is presumably similar 
adopt in natural groups. , romi- 
mmonly rated have to do with three ipa. ien 
(efficiency, Suggesting useful ideas), and ae Col a 
Sures prominence by rating the following behaviors 
scale from “a &eat deal” to “not at all”); 


showed initiative 
was effectiy, 


evaluated in several ways. St yang? 
Over trials is fairly high; with a week between tests, the alana type 
from .75 to 90, Over longer time intervals or with radical changes in ct con 
of problem, Correlations drop to about 50. The test is measuring SOP? ^ ig 
sistent and general 


ical situations 
aspect of Personality. Behavior in practical situ 
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no doubt determined by many forces other than personality. (seniority, 
relative prestige, specifically relevant knowledge, etc.) but LGD scores 
nonetheless have striking predictive value. Bass and Coates (1952) com- 
pared LGD scores with ratings by superiors given as much as nine months 
later and found correlations of .40 to .45. Arbous (1955) reports a validity of 
:60 for LGD against rated promise of executives in training. Suitability for 
the British foreign service as rated after two years on duty was predicted 
( validity .83) by LGD scores at the time of selection ( Vernon, 1950). 

The LGD procedure illustrates the advantage that can be obtained from 
Systematic observations. Social relations are important in personnel assign- 


ment, yet it is very difficult to judge validity from questionnaires, letters of 
y J r > 
or interviews. The LGD is an economical “worksample” 


of group behavior. By scoring observed behavior it avoids much of the bias 
inherent in summary impressions. Army colonels’ ratings of cadet potential 
were much poorer predictors of later merit ratings than were total scores 
recorded by these same colonels acting as observers for an LGD session 


(Bass, 1954). 


recommendations, 


20. Give reasons for each of the following recommendations by Bass regarding 


LGD technique: 
a. Counts of actual behavior (e.g., new approaches suggested) should be 
substituted for ratings of the subject's tendency to suggest new approaches. 
b. Problems should be equally ambiguous to all participants. 
c. Examinees tested in a group should all have the same rank. 
21. Compare LGD and peer ratings as methods of assessing leadership potential. 
Task Leadership. The Leaderless Group Discussion is one of a number of 
"worksample" techniques for measuring personality which originated in Ger- 
man and British military psychology. Psychologists selecting officers thought 
it necessary to observe complex behavior combining intellect, emotion, and 
task devised by the Germans uses two 


habit. One simple team-performance 
pairs of shears linked by rods so that they must move in unison. While one 


Shear is opening, the other is closing. Each subject operates one pair of 
Shears, cutting a series of increasingly complex patterns from a sheet of paper. 
The shears are so arranged that if one man goes directly and forcefully at his 
task, the shears of the other man move in a rhythm which makes accurate 
cutting almost impossible. By means of observation, automatic recording, 
and inspection of the product, the tester looks for evidence of initiative, 
dominance, and coóperation (Kunze, 1931). In a group leadership test used 
by OSS, the American wartime intelligence service, candidates were di- 
rected to move a heavy eight-foot log, and themselves, over two walls ten 
feet high, eight feet apart, and separated by an imaginary bottomless chasm. 
Observers noted which men took initiative and leadership, how they di- 


rected others, how they accepted orders. and so on. 
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struction 
Perhaps the high point of fiendish ingenuity was the rar beer gens 
test. The subject is assigned to build a five-foot cube yit a a paris dt 
Tinkertoys. Poles and spools must be fitted together, and € is axing 
too large to be managed by one man, two helpers are EN : ats, What 
directions, the tester ostentatiously clicks his stop watch and a i^m 
the subject does not know is that his helpers are highly ip — 
bums. Kippy is negative, indolent, a drawback. Buster is an I viget - 
ready to do all manner of things, mostly wrong, and also prime 


a ical dialog 
the candidate with personal criticism. This is reported as a typica 
( Anon., 1946): 


Candidate: Well, let's 


Buster: What is it you want done, exactly? What do I do Bu ke eight of these 
Candidate: Well, first put some corners together—let's see, make e E 
corners and be sure you pin them like this onc. " 
Buster: You mean we both make eight corners or just one of us? 
Candidate: You each make four of these, and hurry. Navy 
Kippy: Whacha in, the Navy? You look like one of them curly-headed 
boys all the girls are after. 
Candidate: Er, no, I'm not in anything. 
Kippy: Just a draft dodger, eh? 
Candidate: Let's have less t 
and you build one over there, , €—À number 
Kippy: Who are you talking to—him or me? Why don't you give x a 
or something—eall one of us number one and the other number two! 
Candidate: Pm Sorry. What’s your name? 
Buster: Mine’s Buster and his is Kippy. What's yours? 
Candidate: You can call me Slim, 


aldy oF 
Buster: Not with that Shining head of yours. What do they call you, € 
Curly? Did you ever think of Wearing a toupee? 


Slim: Come on, get to work. 
Kippy: He's sensitive about being bald, 


. there; 
Slim: Just let's £et this thing finished. We haven't much more time. Hey, 
you, be careful. You knocke 


d that pole out deliberately. 
Kippy: Who me? Now listen 


get going. 


here 
E n are over 
alk and more work. You build a squar 


thing had bee? 


sen 
, they 


Kippy and Buster are Psychologists and are in a position to mak 
cellent report on the ma: 


my 
n’s reaction, (The fact that they had served E x ent 
privates, and that Some of the candidates they were privileged to set 8 
Were generals being considered for specia] assignment, probably also 
untouchable record for job satisfaction among psychologists.) lope? 
22. A field Performance test of an NCO's ability to lead his squad was formance 
by the Army as a criterion measure of Proficiency. Why are some Proficient | 
tests regarded as measures of personality and some as measures of pr 


e an eX 
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THEMATIC PROJECTIVE TECHNIQUES 


T 

he Bender-Gestalt and the Rorschach illustrate one type of projectiv 
* » is 

of handling a problem is the focus of 


technique: in whi 
echnique, in which the subject’s style 
ts may be contrasted with thematic 


Agra These primarily stylistic tes 
es i shi ; x ; 
ts, in which the interpreter 1s especially concerned with the content of 


the iect’s i 

ES pe he arr prie Mea s resembles that between 

ram lane 3 ocus on response patterns. 
- techniques for studying stimulus meanings such as the Semantic Diffe i 

ential and the Rep test. The stylistic and thematic categories are not d 


tually exclusive. One can identify 
specific fears or obsessions in the Ror- 
Schach protocol, and can even inter- 
pret *content" of Bender reproduc- 
tions through Freudian symbolism. 
Conversely, mental style is observed 
in the Thematic Apperception Test. 
But the stylistic tests generally yield 
richer stylistic information than the 


th j 
ematic tests and are rather poor Cartoon stimulus for a thematic test. 
The picture is presented with the statement 


FIG. 97. 
ma's collar." (From the 
Copyright 1950, 
Reproduced by 


s PR" x 
Ources of thematic information. The à i 
"Here is Blacky with Ma 


th ; 
ematic test comes nearer to ex- Blacky Pictures by G- S. Blum. 
The Psychological Corporation. 


amining "the whole person" at once 
than d " permission.) 
an any other testing technique, 

attitudes, and cognitive processes, so that 


seeking ; : ; 
eking information on emotions, 
f the whole person- 


e does give a comprehensive, if tentative, portrait o 
ality, 


The Thematic Apperception Test 
and his coworkers 


n Test of H. A. Murray 
icture by telling a story—what is 


ill be the outcome. The re- 
s, experiences, conflicts, and wishes of 


The Thematic Apperceptio 
pee requires the subject to i 
happening, what led up to the scene, 


sponses are dictated by the construct ase 
the subject. Essentially the person projects himself into the scene, identify- 


ing with a character just as he vicariously takes the place of the actor when 
he sees a movie. The TAT consists of twenty pictures, different pictures be- 
ing used for men and women. Since two one-hour sessions are required for 
the full test, investigators often use shortened versions. The subject is led to 
believe that his imagination is being tested. The interpreter gives particular 
attention to the ihemés behind the plots. The stories may indicate a defeatist 


nterpret a P 
and what w: 
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attitude, concern about overbearing authority figures, or Lipa quon 
with sex. In addition to these aspects of response content, the og men 
considers the style: use of the whole picture rather than piecemeal e , 
fluency, concern with accuracy in fitting the story to the picture, 2 aon 

The interpreter looks at each story in turn, deriving hypotheses fr er 
plot, the symbolism, and the style. The hypothesis from one story (ego m 
man represses all hostile feelings") is checked against subsequent sto tà 
The interpreter must decide how much weight to give to each of pid) a" 
flicting indications and must integrate the information on intellectuat p 3 
ers, emotional conflicts, and defense mechanisms indicated by the test pr 
tocol. 

Only a few illustrations of the analysis can be given here. Card I of ies 
TAT shows a boy, perhaps 10 years old, looking at a violin lying on a ** 


j 56, 
surface. A girl, age 14, with a Binet IQ of 148, gives this story (Henry, 19 
p. 111): 


; ad 
Right now the boy is looking at the violin. It looks like he might be kind be 
or mad because he has to play. Before he might h 
boys and his mother wouldn't let him. He had to 
might practice for a little while and then sneak out. 


: vith the 
ave played ball wit i 
go in and play. Looks like h 


Henry, working from this and other stories, estimated her IQ at 140, cd 
menting on how clearly the story "takes into account the basic stimulus 00 
mands of the picture” and goes on to “entirely relevant elaborations of g "m 
quality . . . [which] attribute motive and action to the characte ; 
Whereas this story led more to a study of process than of plot, the story ? 
42-year-old clerk is interpreted thematically (Henry, p- 145): 


d 
The story behind this is that this is the son of a very well-known, a Very gis 
musician and the father has probably died. The only thing the son has left father 
violin which is undoubtedly a Very good one and to the son, the violin is the #4 usic 
and the son sits there daydreaming of the time that he will understand the ™ 
and interpret it on the violin that his father had played. 
Henry comments that the f 
and a conviction th 


mbition- 
flectiDE 
ossibility 


A young boy sittin 
It is not clear in the 
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on conflicting alternatives: glorification or 


li i ^ 
disgust, has to take and doesn't want. This personality “may well be marked 


bv its attracti stes.” 

" y iie attraction to opposites. The core of conflict appears to be sexual, the 

basic issue bei 5 — P « reas 

en ue being whether woman can be “both the Madonna and the sexual 
ject. . . . This is an instance of the use of the violin as a sexual symbol 

preoccupied with some strong emotional issue; hence 


iliz ae ee es ROMA 
f es form details in a distorting manner [e.g. “violin spread out"]. 
. . . He feels impelled to make a formal heterosexual adjustment as well as a 
3 : > 

onventional social adjustment, even though both are somewhat forced and 


against his will.” 

" These excerpts by no means represent the intricacy of à full interpreta- 

im in which stories are compared with each other and with background in- 

formation about the subject. For examples of such full interpretations the 
and Shneidman (1951). We should also 


reader is referred to Henry (1956) 
emphasize that such interpretations as Henry makes are—if the psycholo- 
gist is properly trained—extremely tentative, and are discarded unless there 


is supporting evidence elsewhere in the test and the subject's history. These 
illustrations do indicate the individuality of style which TAT responses ex- 
hibit, and the variation in the interpreter’s attack. At one moment he views 
the performance entirely as an intellectual effort; at another he treats the 
response as a symbolization of unconscious conflict. How he interprets each 
response depends upon the story and perhaps upon his own artistic impulses 


of the moment. 

] Though interpret 
it is possible to develop obj 
dozens of common variables 
TAT performance: percepti 
tasks, originality, reliance on luck 
themselves are often highly individualistic, 
tabulated, Shneidman (1951) presents fifteen TAT sc 
uch scoring reports the percentage o 
come is unhappy, the number of female characters seen à 
manding, etc. These scores play a larger part in research than in clinical 
analysis of individuals. Use of TAT scores for diagnosis appears worthy of 
further exploration. Dana 1955, 1956) developed four scores for expressive 
aspects of the performance which separated neurotics, psychotics, and 
normals, Mussen and Naylor ( 1954) validated an aggression score for TAT 
Stories, showing that it correlated with frequency of overt aggressive behav- 
lor in problem boys. More than that, when the frequency of mention of pun- 
ishment in the TAT was used a5 à measure of fear of punishment, it was 


shown that behavior depended on both aggressive drive and fear. Every one 
AT aggression and low fear of punishment 


of the seven boys with bigh T 


J — $. 
Note, says Henry, the emphasis 


The man is basically 


qualitative and impressionistic, 
tems for the TAT. There are 
Ww erved in almost every 


on of authority, reaction to extremely difficult 
and magical intervention. The themes 
but common elements can be 
oring systems used by 
f stories whose out- 
s predatory or de- 


ation has been primarily 
ective scoring SyS 
hose strength can be obs 


various clinicians. S 
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showed high behavioral aggression; only two out of nine with high TET 
aggression and high fear of punishment were overtly aggressive. This is m 
example of the oft-reiterated principle that no one personality score is fully 
interpretable by itself. h 

Stability coefficients over two months are in the range .60-.90 for pu 
scores as need for abasement, giving stories with positive outcomes, an 
use of tension-relief words. Though the evidence is scanty, this is a very fa- 
vorable indication of the possibility of accurate measurement, since the 
strength of needs measured by TAT changes somewhat from occasion : 
Occasion and consistency cannot be perfect ( Crandall, 1951; Lindzey we 
Herman, 1955). From these and other studies, it appears that the TAT eor 
lects sufficient information to permit fairly accurate scoring of traits, if scor- 
ing keys are carefully developed toward this end. 


23. How many traits are mentioned in Henry's three interpretations? —7 
24. Can one regard the frequency of punishment by authority in TAT stories 
sample of behavior indicating how often the subject is punished in life? 


Measurement of Need for Achievement 


The TAT is designed to cover the whole range of ideas and behavior ant 
therefore cannot cover any one topic thoroughly. While a person gem 
with independence conflicts may bring the Lp 

reveal their relationships with amd 
designed to elicit such stories. 


m into every story, most 


answer a specific question 
: " ear 
Focused tests are designed to elicit thematic responses all of which ls 


on the same question, For example, Murphy and Likert (1938) carried a 
research on labor-management conflict by presenting pictures of sien 
conflict with police, ete, Shapiro, Biber, and Minuchin (1957) rested Un j 
ers attitudes by presenting cartoon pictures of classroom scenes an Lee 
Air Force personnel was based on the hyp? e 
ggression would be associated with toleranc 


man et al., 1957) 


ne bly 
The possibilities of the focused thematic test have been most thorous 
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Nees by McClelland and his associates (1958). He selected four pic- 
a e from the TAT) intended to bring out attitudes toward achieve- 
"i Š M four cards are a work situation (men at a machine), a study situa- 
iu at desk with book), a father-son picture, and a boy apparently 

ydreaming. Achievement motivation was scored in every story suggesting 


co titi ` 
Mpetition with a standard. For example, 
tal back in the oven with a pair of tongs in 


d isis . 
worker is putting a hot plate of me 
beside him is a helper. 


0 E 

"der to heat it up again. The gentleman 
hievement (n Ach) because the reheating 
he ultimate goal" ( Atkinson, 1958, p. 722). 
veloped (McClelland et al., 1958; 


Vei as showing need for ac 
Detaile b genes to move ahead to t 
Atki ed scoring manuals have been de 
son, 1958, pp. 685-135). 
1 A Second projective measure for the same purpose is the French Test of 
Dsight (Atkinson, 1958, pp- 242-248). A prief description of behavior is 
de e.g., “Bill always lets the ‘other fellow’ win"; the subject is to provide 
an explanation. The test consists of twenty such items, ten in each form. The 
Score is the number of times desire for achievement is mentioned as a mo- 


tive, 
There is evidence that such projective measures are getting at a differ- 
n other measures. De Charms 


nt as : 
a aspect of personality than that shown 1 d t 

al. (McClelland, 1955, pp. 414 f.) made up a questionnaire on desire for 
Achievement (called v Ach). This correlated only 93 with n Ach. The sub- 


lect with high o Ach is concerned with conformity, is deferential to expert 
cessful people. High n Ach, on the other 


author; 

Ority, and disapproves unsuc > 

and, is more associated with striving and effectiveness. Scores on the 
s dgments of motivation to 


T i : 
ench test correlated near Zero with peer JU 
Achieve (Atkinson, 1958, p- 247): French was able to show that the peer 


Judgments of motivation depended heavily on observed success and thus are 


Probably reflections of ability rather than motivation. 
ed to observ d self-report, the projective 


test is i ally associated with strivi 
related to behavior. Hi hn Ach is gener? y riving 
and aedibus ( oci dd al., in McClelland, 1955, p. 421). French 


and Tho d group of highly intelligent subject 
; mas ivi elected group jects 
Into those os pe ain the Insight test and required them 


to lescribed in Cha 
Solve a difficult intellectual prob described in Chapter 1 


is 7). The problem had several 
UP worked, on the average: twice 4 


Eve jons an 
en though it is unrelat ano 


cceptable so. 
s long as the others before giving up, 


hi were much more successful in arriving at at least one mainte, iota 
ighly motivated ersu erformance correlated 36 with ability, but the 
Q » ET 2 

“rrelation was E in oi n Ach group: Ability predicts only when men 


are p 
Motivated to use that ability- 
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f 
It is hypothesized by McClelland that thematic tests -r e | 
motives at a given moment, as well as the average need level o " "i E 
ual. This is supported by the finding that n Ach scores are —€— Š o eei 
sion is raised by ego-involving directions (French, 1955), as wel as by 
ies with focused tests of sexual, affiliative, and hunger drives. 


Prominent Projective Techniques 


ios u- 
During the decade 1945-1955 there was a wave of SAn in 
siasm for developing new projective techniques. Dozens of approach mei 
tried, but in most cases the research was so superficial that the an 
of each procedure, if any, was not established. Only a few of the tec «dt o 
have survived, and some of them retain popularity only Dope p^ addi- 
specialists keeps the test prominent. The following projective niet uency 
tion to those previously described, are encountered with greatest req 
in the current research literature o: 
9 The Blacky Pictures; Gerald S. 
A set of cartoons involving a small d 
ties revealing sexual 
types of conflict deri 
lety). Validation is s 
* Children's Appe 
Bellak, 1954.) 


as might be us 


r in clinical practice. 


; 50. 
Blum; Psychological Corporation, 19 


sto- 


M. Nijhoff, The es nie pet 
es showing two solitary id 
ven around all four pe ni 

user, the test should have values similar to 149- 
of the TAT. (See H. H. Anderson and Gladys L. Anderson, 1951, pP- 


0. This is one of several tests in which the subjec ects 9 
g. Different interpreters emphasize different et in 
though claims are made for successful wa qa 
ul validation studies cast doubt upon the specific 1 


z > pishe" 
tative principles offered (Anastasi and F oley, 1952; Fisher and 45 great 
1950.) There is no doubt that drawings reflect personality, but there 


uncertainty as to how to make sound inferences from them. cholog™ 
[] Make-a-Picture. Story Test (MAPS); Edwin S. Shneidman; a s p* 
cal Corporation, 1947, 4 variant of TAT in which the subject assem 


executes a drawin 
the production, A] 
clinical Cases, caref, 
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backgrounds to make his own pictures. Presum- 


per cutout figures against 
alue though even less structured. (Shneidman, 


ably similar to TAT in v 
1951). 

e Rosenzweig Picture Frustration (PF) Test; Saul Rosenzweig, 1944. 
1948. A set of cartoons in which one figure thwarts another; the subject tle 
ing the part of the second person, is to tell how he would reply. The PF is 
therefore a self-report test using focused fantasies. The test is objectively 
scored. Though it is of definite research value, there is no clear theory for 


interpreting individual scores. 

ar Sentence Completion Technique; J. B. Rotter; Psychological Corpora- 
tion, 1950. Another version, Amanda R. Rohde and Gertrude Hildreth; Psy- 
chological Corporation, 1947. The sentence-c 


the simplest methods of obtaining information 
ing of disturbed persons or as @ preliminary to interview. Unfinished sen- 


tences such as “My mother . - ? or “When I make a mistake . . .” are to be 
completed by the subject. Several versions of the test have been crudely 
standardized. Although responses cam be consciously controlled, the coóp- 
erative subject generally gives a useful picture of some of his salient atti- 


tudes (Rotter et al., 1949). 

© The Szondi Test; L. szondi, 1937, 1951. (See Deri, 1949.) Photographs 
of patients having various diagnoses d to the subject, who indi- 
cates which ones he prefers. It is assu atient does 
not know the diagnoses, his unconscious tendenci 
or another reflect his personal needs. Available evi 
Szondi-Deri hypotheses are invalid (Lubin in Buros, 


ompletion method is one of 
on conflicts either for screen- 


are presente 
med that even though the p 
es to approach one type 
dence indicates that the 


1958, pp. 255-256). 


g the value of performance 


tests and projective tests when used as psychometric instruments, even 
though their variety requires that generalization be cautious. We have seen 
that tests vary greatly in their degree of focus. Some, such as the flicker- 
fusion measure and the box-of-coins test, sample behavior of an exceedingly 
Specific type. Composite scores such as MacArthur's general persistente 
Score or Goldners whole-part score derived from several techniques cover 


à broader range of behavior. Likewise, the focused measures of n Ach and 
the Leaderless Group Discus ive reliable scores which have 


sion procedure g 1 
: a à s is almost com- 
appreciable correlations wit 


l : much harder to 
a etely removed, as in the TAT quic 
easure any one variable accur P 


testing, therefore, these conclusions see 
© Highly structured tests of narrowly 
(psychologie? 


nes 
; iren d save in developmen m) "ln à 6 
y to have ultimate practical val ept ina è 


Some summary may be attempted regardin 


ach, it becomes 


ately. In present state 0} 
m justified: 
defined variables have little useful- 


| theory. Such tests are un- 


ue exe omposite or battery, ot in 
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Some rare situation comparable to the use of a color-vision test as a pica 
of occupational aptitude, where a specific test factor duplicates a specihc 
task requirement. 1 
9 Less narrowly focused tests have potential value as measures of qe 
personality attributes, The one procedure now known to have practical T 
is the LGD, which is a worksample. With composite performance scores T 
MacArthur’s or focused tests like McClelland's, significant traits can be 
measured. These trait measures are of potential value as measures of al 
pendent and independent variables in research. Because numerous we 
interact to determine behavior in any situation, any one such measure W! 
rarely have a large correlation with any practical criterion. -— 
9 Unfocused projective techniques are poorly adapted for quantita E 
measurement, although scores can be derived from them. Their chief inl 
tion is in impressionistic assessment of individuals, as part of a thorough ca 


$ à xt chapter 
study employing numerous other sources of information. The next chap 
will discuss this. 


Suggested Readings 


65- 

E Bernard M. The leaderless group discussion. Psychol. Bull., 1954, 51, 4 
2 “ 
This is a comprehensive of LG 


account of evidence on the practical validity 


analysis of the personali and ability facto 
lead to good LGD performance. i i 


ven- 
m-solving situations. Life and ways of the 5€ 
» 1952. Pp. 298-344, olds, 


The 
for 

E P results 
and method of scoring and summarizes igators 
jven: 


rs which 


à om the erformance observations can 
with other data, p 


ycho- 

Burdock, Eugene I., Sutton, Samuel, & Zubin, Joseph. Personality and psy! 

pathology, J. abnorm. soc, Psychol., 1958 56, 18-30. n 
Research is described involving nineteen ism tests of traits we 
from the Physiological to the conceptual. Preliminary results are giv ether 
the possible diagnostic and theoretical significance of each test, tog 


with critica] i ed in ill 
wen comments on performance measures previously us 


Levitt, Eugene E. The water 


Bull., 1956, 53, 347-370. 
Levitt 


ging 


-sdy PSY cho" 
-jar Einstellung test as a measure of rigidity. PSY 
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of the Water Jar test raises questions applicable to nearly all performance 


tests. 
McClelland, David C. Performance traits. Personality. New York: William Sloane, 


1951. Pp. 162-199. . . 
A penetrating discussion of the use of performance tests and ratings to isolate 
useful dimensions for describing typical behavior includes an excellent survey 


of factor-analytic studies of personality. 


19 


: ics 
Assessment of Personality Dynam! 


: ability and pe 
THE psychometric tradition isolates separate dimensions 9s -— dimen- 
sonality and represents the individual by assigning scores o ts are seen as 
sions. Other testers follow a more artistic tradition in which - i 1 system 9 
just one procedure for gaining insight into a ipic, ccpit is 
needs, concepts, and perceptual attitudes. The i spe qn Hn di: 
not primarily concerned to arrive at quantitative scores on a prr the in- 
mensions. He is concerned with the organization of processes " interprete? 
dividual which give unity to his behavior. The impressionistic could 8€ 
asks what "personality structure" (intrapersonal organization ) t others in 
count for the observed facts—for the ways he perceives significan e the 
his life, for the discrepancies between his abilities on various idit etc. 
seeming differences between his fantasy needs and his overt be pien abou 
Such a coherent picture is of great potential value in making grep com- 
a case, though its usefulness depends entirely upon its quality available 
pleteness, and these in turn depend on the range of inform: 


nt 

-ins Judi 

and the astuteness of the clinician's synthesis of it. The clinician s J 
makes full use of his psychologic: 


; f 5 pues iliation O 
cases, but his final portrait of the case is an artistic reconciliatio 
impressions. 


A " wi 
al theory and his experience f diverse 


as @ 
l used 25 ? 
Almost any psychological test, observation, or interview can be 


je 
3 , les: JO 
basis for such a portrait, We have illustrated this with many examp. ^ 
description of John Sanders fr 


ks 

om the Stanford-Binet (p. 187); ee ap^ 
scription of a type of underachiever (p. 456); Grayson's descriptio ve fro” 
tient from the MMPI (p.491); Osgood's and Luria's description quM 
the Semantic Differential (p. 508); Henry's three partial TAT "pich the 
tions (p. 570). These examples vary considerably in the extent ib ske ches 
interpreter speculates beyond the observed facts. Henry's p ing meth" 
range from an almost litera] description of the young girl's ied s on? 


; an. 7 
ods to a symbolic translation of the story given by the imate awe imag 
further example which shows how freely a clinician employs C! 
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ination, psychoanalytic theory, and even frank speculation, we may quote a 
Bender interpretation given by Max Hutt (Shneidman, 1951, pp- 297-233 ) 
for the same patient whose MMPI Grayson interpreted. 

Hutt first notes that the reproduced figures are arranged in the same se- 
quence as in the original, the first six being aligned with the left margin, but 


that the last two drawings are fitted into the right half of the page. 


e: this individual has strong orderly, i.e., compulsive 


needs, tending towards a sort of compulsive ritual, but tries to deny them [the first 
and the examiner having noted 


cong being displaced away from the margin i ‘ 
esi the man draws fast and unhesitatingly] . - +> and he is oppressed with 
i me (probably) generalized feelings of anxiety and (more specifically) personal 
nadequacy (clings to the left margin and is "constrained" to use all of the space 
Available to him on this one sheet). We raise the question for consideration, at 


9nce, *How strong and from what source is this anxiety and what is his defense?" 
his can speculate, from his use of space, that he attempts in some way to “bind” 
de ee i.e., he cannot tolerate it for long or in large amounts, and that one of 

eatures of this young adult's functioning is the need of control. . . . The 


Super- i 
per-ego is very strict. 


Our first hunches then ar 


Ke Hutt has looked at the style of the man’s performance, and then has 
to infer what inner tensions and defenses could generate such a style. 
isi he says, these are hunches and speculations to be checked against other 
Svidence in the protocol and against all other information about the patient. 
As in most “dynamic” interpretations, Hutt uses Freudian concepts of drives, 


Confli 
nflict, and defenses. 
As an example of more detailed analysis, consider Hutt's remarks on the 


Subiecs . 
bject’s reproduction of the rows of circles (see Figure 98). 


Vari; bi the ten diagonal columns of circles [offer] further evidence of the marked 
ability which begins to appear to be characteristic of this ^S." The examiner 


Notes “ $ 
Fee Checks number of rows (i.e. CO. thirds through. We note 
eda angles of the columns of dots differ, becoming more obtuse (from the ver- 

figure is exaggerated in the 


tica ; 

iri a correction towards the end. 
culty in Eus n pe gei 2 Th ientation of the first column is 
Correct in establishing such relationships. e ori 3 

gets ii so the variation in “angulation is not ee ie 

ave € number of columns correct, but varies H ^ (a ea : 

tempt. imei then, for the presence © papi a [5 sw a 
the need mm "A Sa! ee a variability of performance? 

f Hutt’s yeasoning about perceptual style, 
for comparison with the MMPI report and 
491). All the following descriptions are 
derived and with what de- 
five: sa acting Outs» 


Wit 
we hout giving further details o 
quote a few of his conclusions 


S am 
ig o iin of the therapist (p- i 
, in a context which explains how they were 


re 
See of confidence: compulsive defenses not effe 
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regressive impulsivity breaks through . . . possibility of psychotic episodes 
- » - depressive reaction. : T 
The clinician's interpretation of Scraps and shreds of evidence is darir E 
Can one really know a person from the minor irregularities of his ade z 
dots? But if Hutt's interpretations of style seem bold, there are more siart ae 
things to come. From statements about defenses and controls, Hutt turns 


. 228) 

FIG. 98. Bender reproductions by a young male mental patient (Shneidman, 1951; P 

the symbolism o 

line” and “femi 

mal 

. [In the figure composed o Ý 
S” has increased 


ion to au- 

j tior 
the vertical si S's reactio " 
eio al sides are, . E a gul 
thority figures can a es of the open square fig 


ie 
mpl 
imp E 


f the figures, follo 


“paset” 
: TNR ut m 
wing certain Freudian ideas abo 
mine” designs: 


E A re likely’ n 
ji : acting out" hypothesis, the former is more reveals ? 
curved portion of this figure is enlarged teet out in the middle and rev ntifi- 
impulsive flourish : 


: ’g major €^, ie 

NEHME at the Upper end. Now we may speculate that S’s m li es Jon 
cation is with a female figure, but she is perceived as more masculine U i 
nant, aggressive) than f 


i 
Bsa: x s agonis 
minine and is reacted to openly with antag 
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interesting that the upper portion of the curved figure extends well above its posi- 


tion on the stimulus card, and is at least as high up as the vertical lines. Here we 
or surrogate) was stronger psychologically than 


may conjecture that S's mother ( 
d like to use his mother (or 


his father, or at least seemed so to S, and that S woul 
Women) to defy his father (or men). 


Such test interpretation is often severely criticized as unscientific. In de- 
fense of the method, we may note that Hutt is able to give a detailed ration- 
ale for each of his inferences; he is by no means allowing his fantasy free 
rein. A much stronger defense is that his description of the patient agrees 
well with the clinical picture given both from the MMPI self-report and in 
the therapist's notes. Word for word, we find confirmation there of ineffec- 
tual defense through obsession, hostile impulses the subject fears to express, 


and so on. Other remarks of the therapist support some of Hutt's most haz- 
“The father seems to be a hazy person in the pa- 
angle his mother.” “Had difficulty 


analysis was not a perfect descrip- 


about the depths of personality 
ild interpretations of a little task 


ardous-seeming guesses: 
tient’s life.” “He talked of wishing to str 
with authority figures.” While the Bender 
tion, it yielded much better information 
than one might expect from seemingly W. 
Suitable for a child's drawing exercise. ps "p 

The fact that clinicians continually have such striking Successes with in- 
dividuals gives them considerable right to feel confident in their methods 
and theory, At the same time, the clinical tests rarely satisfy the demand for 
Systematic validation. If it was difficult to nail down the "cie - the 5 
Score on MMPI, it is impossible to put into — m n = or 
Such innumerable hypotheses as that enlargement of the Bencer wave orm 


indi N : 's mother. 
indica lar attitude toward ones T 
tes a particular ath m ft by means of a single complex 
The dynamic interpretation 9 a 5 
hniques is frequently called “assessment, 
test or a whole assortment of tecaniq t uestis acai Sb 
to distinguish it from psychometric measurement. t r 
monly tak f forms. The first is clinical analysis, well illustrated in 
dimmi ri i The second is prediction of perform- 
the Grayson and Hutt interpretations. Tho Uo" responsible jobs 
: sign . 
ance of normal or superior persons assig e) Y 
eei cnt vi pul T emie be 
ing of the 1930's, whose team-performanc® pe s imarily as sam lës of 
the hands of German testers, tests were p fn Cites aan 
- J ver and rigidity. B 
Character traits such as will pow : ; s 7 ae 
adapted in Great Britain for War Office Selection ES - z EEan 
: ea : ranks instea rely- 
Gens . «y to select officers from the ran s instead of r ly 
onditions made it necessary tO Jasses, these boards took responsibility 
ing on professionals from the upper ^ » Jicants for commissions. In the 
for judging ability and character of apl 


Uni ] y Murray and his associates described the aj 
ited States, Professor ]Henry 


he person 
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i: ents in 
plication of a large number of asessment techniques tọ = one er War 
the ground-breaking Explorations in Personality wy i. a, Med the 
II, Murray was asked to select staff for the Office o E i m Staff, 
forerunner of today’s Central Intelligence Agendy (O : team-perform- 
1948). In that program, use was made of group poor quee sac peer 
ance tests, stress interviews, observation at meals and — es. The range 
ratings, projective techniques, and structured tests of pus Pn 1 its close 
of the testing program, its claim to penetrate hidden cep p writ- 
and-dagger mystery have provided vivid material for ung I omes ad 
ers; see A. P. Herbert's Number Nine (1952), an attack on sent years 
servant selection, and Morgan's The OSS and I (1957). In ds xecutives 
assessment methods have been used for selecting officers and € 
and in numerous research programs. £ a. variety of 

The principal features of assessment procedures are: =- a A 
techniques, primary reliance on observations in unsure si ik pro- 
integration of information by experienced psychologists. piai facts, but 
gram refuses to employ intelligence test scores or other ie sioe rather 
the emphasis remains upon synthesizing these data peas 2 
than upon combining Separate scores in a statistical formu Bs assessment: 

Our chief concern in this chapter is to evaluate impressionistic < 


i 
i me supe 
There have been many validity studies, some penetrating and so 
ficial. We shall review the best of these. 


VALIDATION STUDIES 


Attempts to Predict Job Performance 


The original assessment pro 
were never adequately valid 


vices 
erec 
per- 
that 


à igence ser 
grams in the military and intelligen 


-catt 
é e scat 
ated, largely because candidates "aik 
to far places and to diverse duties, so that criterion data were la 


qm m 
haps the most meaningful figure is the report from the British fend cor 
ratings of 500 officers in combat by their noncommissioned subo: d to app! 
related 35 with Selection Board ratings. ( This validity is correcte 


ou 
. ricted EP 
to the entire range of candidates processed rather than the restr These 
recommended for com 


British studies also fou 
group of candidates w 
tion between ratings 
ever, by teams train 

Assessment of Britis 


d the c? 
85 assessed separately by two boards, a diel 
was only .67. Reliabilities of .80 were dem f 
ed to use similar standards and procedu G 


ts 

» assessmen ^ n 
h Civil Servants. Three-day “house-party ri (19508): 
candidates for the British civil service were validated by Vern 


reme 
who collected follow-up data on the men accepted. Though meas" 
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university graduates is extremely 
couraging. Grades in a training 
alidity of .82. For 


indivi ; , 
D e differences among such superior 
iffic : 

cult, the validity coefficients were en 


Course were predicted by final assessment rating with a v 
job-performance criteria the validities were .50-.65. 

Two hundred administrators were rated by superiors after a two-year 
probationary period. Vernon gives fifty correlations of predictors with this 
criterion, and the predictors fall into two distinct groups. There are 27 va- 
lidities for written ability tests; every coefficient is below .30, and the median 
is about .12. There are 19 correlations for ratings made by observers after 
performance tests or interviews these correlations range from .26 to .49, the 
median being 41. Peer ratings had validities near .95 for this group. Evi- 


dently the impressionistic procedure identified aptitudes the pencil-and- 


Paper tests did not. 
1 It is important to note that in this successful 
had “a clear and agreed conception of what they were selecting for, based 


9n a thorough job analysis." The performance tests were for the most part 
job replicas of civil serica paper work, committee tasks, and group discus- 
sions, Little use was made of personality theory. Projective tests, field obser- 
vations, and stress interviews were absent OF were given minimal attention 
os ee logists When the Veterans Administration 

A Study of Clinical Psychologists- yd logists in 1947, it sponsored a 


ega: ne :nical DS. o. 

gan to support trainin s a Lowell Kelly. Kelly’s team, 

Which i : ni hologists and experienced OSS as- 
ich included prominent clinical psy ut procedure: objective, pro- 


Sesso WE 
3 rS, applied "every prom native” to a group of 187 graduate stu- 


ecti 
Mig subjective, clinical and quantitative groups were vies 
s in a nine-day assessment program. X Eae 
meted, Criterion ie collected from universities in 1950 e Ps wi 
à xe ici: as a studen 
dE n on the trainee's ability as à therapist, 25 tician, an 
me methods. eral surface habits (e.g. 
r uring assessment, ratings had been made PE haracteristic inten- 
s > à N 
e to coüperate); underlying ba a E a psychologist 
x of inner emotional tension)» 97 ext b by = 
Ecc cif roles (e.g. group psychotherapy) end Sdn ri 
Sons wh j signal data, by 
M o knew onl the situation: : zr Re ac 
terview, and B y ho bad ns of data; this 
Procedure es eau] bes ic Only a partial account of 
mitted compa 
S i e. 
dee of correlations © pe tings based on them 
able 70 indicates how well single test scores OT ratings pho 
Predicted certain important criterion ratings. The correlations, | ough fre- 
Wently better tl Pi Ace, are much too small for the predictors Ue 
Si han chance, Š : abili 
ingly to be of substantial value i? selection OF guidance. The general ability 


assessment study selectors 


a diagnos 
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icti :ademic perform- 
test was much better than other methods for oiim. — Lue iniit 
ance, and also best for predicting rated clinical ione ia plici ndi 
measure is peer ratings; these had some validity pee or reete salt ui pd 
of diagnosis and therapy. The Bender-Gestalt did remarkably 


in 
" " Competence 
TABLE70. Selected Validity Coefficients for Single Predictors of 
Clinical Psychology 


Criterion Ratings 


inical 
Clinica 
" etence 
Academic Therapy Diagnostic oi 


24 35 
Miller Analogies (verbal ability) 47 .02 . " 
Guilford self-report questionnaires "m 23 d» 
S (social extraversion) 06 p 7 (19) 
T (thinking extraversion) .05 tro (23) k 
Highest of 13 r's (.14) . 
MMPI i pata 
ighest of 9 r's, = 
id= ideal (26) (=16) — (—12 25 
Gough Psychologist key 16 15 " 
Strong VIB 04 EU 
Group | (creative-scientific) 26 95 23 2 
Kriedt Clinical Psychologist key 10 E 133 15 
Ratings from Bender-Gestalt als aya 24 7A 
Ratings from TAT .08 . S 02 00 
Ratings from performance tests (pooled) .19 J 05 "25 
Self rating .25 —.20 23 r 
Peer ratings 13 .28 


Boldface correlations 


are statistically significant, 
Source: Kelly and 


Fiske, 1951, pp. 146 f. 
of the criteria, Validities 
fact that the trainees ha 
suitable candidates. 
Table 71 shows the over 
bining all sources of data. 
and .38 for clinical compet 


i5 und th? 
E criteria an 
are lowered by the unreliability of crit 


un 
iously 
setae bviou 
d already been screened to eliminate © 


mig »ssments | 
all validity for impressionistic uk 
ade 
The correlations of .46 for goas h the bes 
ence compare very favorably wit 
TABLE 71. Validit 


on 
: Based 
Y of Ratings of Clinical Psychologists 
Comprehensive Assessment 


Criterion Rating 


Clinical 
n 
] is Compete 
Assessment Rating Academic Therapy Diagnosis ds 
Academic .46 -09 2 45 
Therapy .24 -24 kid 32 
Diagnosis -36 14 a 
Overall suitability for 22 m 
clinical Psychology 27 48 E 
Source: 


Kelly and Fiske, 1951, p. 161. 
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could be expected from a statistical combination of scores and ratings. The 
coefficients for diagnosis and therapy are lower, as in Table 70. This is partly 
to be explained simply by the inadequacy of those criteria. Probably an 
even more pertinent explanation is that the assessors, some of whom had no 
experience in analyzing personalities of graduate students and none of 
whom had studied the personality required in such roles as psychotherapist, 
were unable to make competent ratings on these two scales. 

The third table indicates how much each part of the assessment program 
added to the final judgment. (Since this final judgment is based on only 


Change of Validity with Added Information 


TABLE 72. Evidence on 
Criterion Rating 
MEN E e 


Clinical 
Academic Competence 

Godin file plus objective tests (one rater) 36 37 
bove plus autobiography, projectives (one 

rater) 38 40 

Above plus interview (one rater) 32 37 

All above information (conference of three 

raters) 32 E) 

Above plus performance tests (one rater) 331 .39 

.33 7 


Final pooled judgment of three raters 


Sounce: Kelly and Fiske, 1951, PP- 168-169. 


ff, the figures cannot be matched 


three assessors rather tha 

with those in Table 71-) Apparently, assessors did just as well when they 
had only the credentials file and objective test scores as they did with the 
addition of interviews and performance tests. This information, considered 
along with the modest validity coefficients for the performance tests taken 
alone (Table 70), does not encourage he performance observa- 


faith in t 
tions, at least in the absence of psychological job analysis. 
Menninger School of Psychiatry study. A similar problem was investigated 


by Holt and Luborsky (1958) at the Menninger School of Psychiatry. 
Applicants were interviewed and evaluated by structured and projective 
tests, One principle criterion was the competence of the accepted man as 
judged during the residency which completed his training. 

ctice of estimating success 


The original assessment employed the usual pra 
s based on psycho- 
valuate how well 


on the basis of the assessor's judgment. This judgment i 

logical or psychiatric theory, which is presumably able to eva" 

à person functions in challenging situations. The results were m general un- 

Satisfactory. The average validity for the combined information from tests 
94, Even allowing for the 


was .27, and that for interview assessments was -4 
ly on the restricted. group who were 


fact that the correlations are based on 


n the whole sta 
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s — erform- 
test was much better than other methods for predicting eee Lie: m 
ance, and also best for predicting rated clinical — eaaa mS 
measure is peer ratings; these had some validity oven se vide 4 dl en. Bn 
of diagnosis and therapy. The Bender-Gestalt did remarkably 


á i etence in 
TABLE 70. Selected Validity Coefficients for Single Predictors of Comp 
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Group | (creative-scientific) .26 06 t 126 
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Ratings from performance tests (pooled) .19 19 05 00 
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Peer ratings 13 28 d 
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OURCE: Kelly and Fiske, 1951, pp. 146 ff. 
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and .38 for clinical compet 


js 
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Ve iud | sars from a statistical combination of scores and ratings. The 

veg ae vis € and therapy are lower, as in Table 70. This is partly 

ia in A simp y by the inadequacy of those criteria. Probably an 

fa ce explanation is me the assessors, some of whom had no 

awe bom be e Ls of graduate students and none of 

pies at died the personality required in such roles as psychotherapist, 

" ee e to make competent ratings on these two scales. 

Sack sor : table indicates how much each part of the assessment program 

he final judgment. (Since this final judgment is based on only 


e of Validity with Added Information 


TABLE 72. Evidence on Chang 
Criterion Rating 
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Clinical 
Academic Competence 


.36 37 


Se file plus objective tests (one rater) 

ove plus autobiography, projectives (one 

À rater) .38 40 
bove plus interview (one rater) 32 37 


All above information (conference of three 
" raters) .32 .42 
bove plus performance tests (one rater) 31 39 
33 .37 


Final pooled judgment of three raters 3 
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al 
one (Table 70), does not encourage faith in the performance observa- 
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A similar problem was investigated 


b Menninger School of Psychiatry Study. 

Fd ut and Luborsky (1958) at the Menninger School of Psychiatry. 

on icants were interviewed and evaluated by structured and projective 
sts. One principle criterion was the competence of the accepted man as 


Judged during the residency which completed his training- 
loyed the usual practice of estimat 
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"e e basis of the assessor's judgment. This judgment is based on psycho- 

ogical or psychiatric theory; which is presumably able to evaluate how well 

à person functions in challenging situations. The results were in general un- 
nation from tests 


Satisfactory. The average validity for the combined inform 
Was .27, and that for interview assessments was 94, Even allowing for the 


fact that the correlations are based only on the restricted group who were 


ing success 
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accepted and finished training, these validities indicate a high rate iecur 
Holt and Luborsky (1958, II, p. 139) examined whether some in h they 
ers or test interpreters were markedly better than others, and gd d m3 
did find differences, they conclude that the evidence throws cold bo ae 
a frequently encountered suggestion for improving selection method A p 
the interviewer who does the best job and have him teach the athers op ie 
does it. Even if one entertained the dubious assumption that an = "^ 
knows ‘how he does it, and is able to teach the helpful rather than = e 
roneous parts of his technique, there is still too little difference between 


r f thers to 
predictive performance of the best interviewer and those of the o 
make such an endeavor worth while.” 


inventory, and an in 
ber of correct classifi 
chance alone. Even 
was 56 percent, co 


hool we 
at recent graduates of Officer Candidate Sch " 
successful] in assessing 


lected applicants wer 
There were four asse 


collected includeq ratings by the officers and 
other squad members, and self-report scores, Pass-fail records in O 
the criterion. 
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age ratings by the assessors had a validity of 


.55, and peer ratings of .58. Ratings based on specific performance tests 
generally had validities between 25 and .50. Self-report tests had essentially 
ith success. It was recommended that a five-day version 
selection whenever a reasonably large supply of 
ble. Although the assessment is time consuming, 
ocedure provides sufficient training and orienta- 
f success in OCS, and therefore is eco- 


For all men combined, aver: 


zero correlations w. 
of the procedure be used for 
qualified applicants is availa 
there is evidence that the pr 
tion to increase the man’s chance o 


nomical. 
IPAR Study of Air Force Officers. The most extensive research on assessment 


methods is the program of the Institute for Personality Assessment and 
Research at the University of California. The Institute was organized by 
D. W. MacKinnon, one of the original OSS staff, for the purpose of testing 
and improving assessment procedures, particularly as applied to superior 
men. Studies have been conducted on student, military, and professional 
groups, but the only major report is of an assessment of Air Force captains 


(MacKinnon et al., 1958; Gough and Krauss, 1958; Barron et al., 1958; 
Gough, 1958; MacKinnon, 1958; Woodworth and MacKinnon, 1958). 
This is an exceptionally good test of what assessment can and cannot do. 


An expert staff was assembled, range of procedures was 
applied to a large sample of men for whom several appropriate criteria were 


later available. The staff had a reasonable understanding of the criterion 
task, havi ng previously carried out several studies of military personnel. Pen- 


Ys interest uestionnaires, 
and-paper tests (ability measu and interest q 


08, personality apt 

Braphica] data) were taken by 343 captains eligible for promotion. ° 

ese C ; romat i 
36, 100 officers were brought to living-in 


Sess osje of 10 for 
. ment. For three days, they live 


gether in Br ing-it 
q with the psychologists, being inter- 
of i ts, objective tests 
Wed, having a medical examination, taking P , obj 
perform 


rojective tes 
' being evalu- 
Perceptual performance, and group = a ta ( eel 
Y the staff in informal contacts- Jodi eee ta” p 
"and-pap ier) soo d 398 scores OF ngs from living-in rien 
cores an m : > à 
Se Sse scores were compared with nine major criteri? = rt pu ea 
; °F various subgroups of subjects; more than 15,000 validity 
Te ea] 
culated. he n" 
n DAN dragnet search for correlates of officer effectiv eness is ema to 
so re Just by docens 5 percent 0 the variables will show ‘se em 
"relations wi ten. and It always possible to inven a plausible 
*Planation f h any criterion Tiei , however, guarded against 
Serious Minn eie eo dividing nd confining interpretation 
rpretation by c1 


and an enormous 


ei 
z rati 


° resul les 
ts whi ; Į subsamp °S: "e 
. Ba pa appeared an stie dublous validity of the criteria. Independ- 
erious difficulty is correlated in the neighborhood of .30 
s 


eri A 
teria of officer effectivenes 
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T ¿pected 
(Barron et al., 1958, p. 5). Consequently, test validities Mates ia 
to rise much beyond this level. Even the most valid —— mit 
cannot predict unstable criteria. The probable mms iB d grep stud- 
among various criteria are the restricted range of ability d 1 ; various su- 
ied, and the difference in standards of judgment employed by 
eriors. " “overall 
P The staff ratings in which a global assessment was ven aa Y an any 
military effectiveness") did not correlate beyond the I MacKinnon 
of the criteria of effectiveness. All correlations were below = : x "ood-officer 
1958, p. 36). From the psychometric field testing, a — » S dno re- 
index” was computed on the basis of a formula developec T P the Air 
search on officers. This composite correlated no higher than -— the assess" 
Force criteria (MacKinnon, p. 28). Three “clusters” Re 
ment ratings had “disappointing” correlations with criteria o E ful as pre" 
(median .13). But although impressionistic ratings were unsucc ien When 
dictors of effectiveness, a reanalysis provided some encouragem ades ‘ol 
flying officers were separated from ground officers, the € sizing UP 
and —.02 respectively. That is, the assessors did reasonably m Ls succes 
flying officers—considering the instability of the criteria—and nad 1958, pP 
at all in evaluating ground officers (Woodworth and MacKinnon, 
11-13). . n interpe? 
Whereas effectiveness was hard to predict, a criterion rating © n office! 
sonal relations (uncorrelated with the criterion of effectiveness as à the 
was successfully predicted. Several assessme 4s. No test score 
range .20-.30, which is about as good as the criterion permits. „ame fro 
had appreciable validity for this criterion; the valid Mua quoc con- 
staff appraisals. In general, the person seen by the staff as etim wit! 
forming, and relaxed was rated by his superior as having good rela 
others ( Barron et al., 1958, p. 24). estotal office" 
The validity of psychometric measures was evaluated for ults var 
group, but not for flying and ground officers separately. The meee 
slightly from one criterion to the next, but no measure gave con correlates 
dence of validity. In fact, out of 194 test variables not a single ci subsan 
significantly with the officer effectiveness rating in three ein neighbo i 
ples (Barron et al., 1958, p. 15). A few scattered correlations in th vica 


i 
. Dv in emp 
hood of .30 indicate that there is promise of predictive value 1 nses 
keys based on successful 


spo i 
performance (e.g., a CPI key based on | p -A 
high achievers) and in self-ratings on adjustment (Gough, 958 
Rorschach and TAT were no 


j jnnov, al 

t useful in assessing officers Maskin B Job? 
Summary. The foregoing studies include the major repa S a5 
predictions to date. The most favorable results were obtained in 


;Mtes in 
nt ratings had validities 
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ment, followed by British civil service selection. The VA psychologist study 
for psychometric prediction and for assess- 
ment based on those tests plus a credentials file, with interviews and per- 
formance tests evidently adding little. The initial Menninger study of psy- 
chiatrists (their later study remains to be discussed) was less successful, and 
the classification of emotional failures among pilots by projective tests 
Showed zero validity. Common elements in the more successful procedures 
may be noted: 
€ There is no evidence that psychological training gives the assessor an 
advantage. The best results occurred when officer candidates were rated by 
recent OCS graduates. The worst, as it happens, were obtained where the 
assessors were expert clinicians who, however, lacked specific experience 
with the types of candidates and criteria under study. The clinician’s ex- 
perience and theoretical background gives him confidence in the judgments 
s actually superior to those of 


he makes but seems not to make his judgment 
the intelligent untrained observer who knows the job requirements. This 
8 


the frequent finding that peer ratings have 


by observers. 
vhich are very near to work- 


lidity. These tests are di- 
f intervening personality theory and can be 


used by nonprofessional judges. Tests requiring the judge to infer the sub- 
Ject's personality structure and then to predict Tidtigvior were rarely benefi- 
Cial in these studies. Group performance tasks including LGD make an im- 
Portant contribution in predicting criteria where acceptance by one's group 
is necessary for success. They contribute much less to prediction when the 


Criterion task calls for individual performance. 
ant requirement for valid assessment is that the asses- 
a 


Sors have a clear understanding of the psychological requirements of the 
criterion task. The civil service and OCS assessors understood the ability re- 
quirements of the criterion task and made little effort at subtle psychologi- 
cal evaluation. The VA assessors and the Menninger assessors tried to match 
the candidates against their mental pictures of the successful a. 
or psychiatrist. The IPAR assessor? assumed that the eme The or s ec- 
tiveness were the same for ground officers as for flying officers. 1 ese s ereo- 


types had never been checked by controlled observation : agen m 
formers, If such an “obvious” rel as that between spatia p a 


ation @ rising that stereo 
s : : t is not surpri 7 
uccess in geometry is contrary t 


the facts, i 
types concerning therapists prove fale. —— . . 

It seems fair to conclude that impressionistic i 
Used where they have no validi 


showed about equal validities 


conclusion is supported by 
as those of ratings 
formance tests v 
have considerable va 


validities as good 

© Structured tests, or per 
Samples of the criterion task, 
rectly interpreted without use o 


9 The most import 


pretations are often 


nter 
st even 


ty. The assessor must learn to distru 
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the most compelling hunch until it has been independently "ord 
Kelly’s apt phrase, too many psychological techniques are used on the Dé 
of nothing more than “faith validity.” -— ;t ambi- 
The generally black picture painted above of psychologists most ? n 
tious efforts at assessment is not, however, the final answer. There 8 “a 
difficult problem of reconciling the statistical evidence with the — zs 
“clinical validity” of assessment techniques. We need to identify the sour di- 
of error in assessment and to arrive if possible at a statement of the con 


tions under which they are or can be made profitable. 


ae rela- 
1. When a group has been preselected before collection of validity data, cor 


tions are reduced. In which of the assessment studies cited are the correlations 

based on groups more restricted than those which would usually be ae 

In the OCS study, peer ratings were collected from nine men and € ais 

ratings from four judges. How is this fact relevant to the interpretation 9 

validity coefficients of .58 and .55, respectively? ke 4 

3. In which studies did the criterion depend substantially upon ability to - 
good impression upon and win the coóperation of peers? 


Ld 


Sources of Error in Assessment 


In order to understand the difficulties of assessment, we need to pe 
the steps involved in information gathering and inference. Figure m in 
pares three types of personnel evaluation: inference based on apa 
terpretation, direct impressionistic evaluation from behavior samples, ° 
psychometric prediction, Jead- 

Let us begin with the right-hand column, which outlines the stages 


" eni 
ing to the criterion. Each box distinguishes one stage. Between ae 
small type lists some of the sources of error which preclude a perfect tween 
spondence among the findings at successive stages. Time intervenes ba this 
assessment and criterion performance; changes in personality during ) 
H me ce 
period reduce the possibility of perfect assessment. Job performan e in 


depends not only on personality but on the specific conditions "d pe 
dividual's job. Given a different superior or different assignments, E ab) 
performance might change, 


gant The criterion 6d reflects performance | the 
indirectly, being affected by the bias and incomplete observation 


n 
À t eve 
Supervisor. The sources of error in the right-hand column imply É cri- 
with perfect information about personality one could not predict J 
teria perfectly, 


oy an 
The simplest assessment method is psychometric scoring of behavior fan 
application of a “cookbook” formula to arrive at a prediction (center NI 
of Figure 99). Reduction of behavior to scores discards some amount roped 
formation. The combining formula may introduce error if it was dav" ag? 


under conditions that do not apply perfectly to this new sample. Every 


la PERSONALITY AT TIME 
OF ASSESSMENT 


Sampling of behavior 
Motivation for test 


2a. BEHAVIOR DURING 
ASSESSMENT 


Bios ond error Selection of 
of observation 


information 


3a. OBSERVER'S 
PERCEPTION 3b. TEST SCORES 


Theory of 
personality 


4a. INFERRED 
PERSONALITY STRUCTURE 


Theory of tosk 


requirements Combining 


formula 


5a. PREDICTED 
BEHAVIOR ON JOB 


Assumptions about 


Bias or values 
supervisor's values 


of observer 
60, PREDICTED Gh OBSERVERS éc. PREDICTED 
FAVORABLE OR 
MENT UNFAVORABLE feu 
RATING 
IMPRESSION RAINE 


FIG. 99. Stages in assessment and in criterion development. 


Intervening 
experience 


1b. PERSONALITY 
ON JOB 


Relation to superior 
Specific responsibilities 


2b. BEHAVIOR 
ON JOB 


Bias ond error 
of observation 


3c. SUPERVISOR'S 
PERCEPTION 


| Values of supervisor 


6d. CRITERION 
MERIT 
RATING 
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i r error. 
from 1a to 6c and from 1a to 6d involves an additional oe 
By this analysis, there are seven places where error can lowe 
tion between prediction 6c and criterion 6d. m — O88 

Impressionistic rating from observed behavior is a scant 
study, where value judgments were made directly, with 


; urces 
dynamic analysis of personality. Here again there are seven possible so 

of error between 6b and 6d. These m 
if, as in the OCS study, the bi 
Supervisors, 


od stages 
Dynamic assessment, in the left-hand column, involves two eri =. 
of inference. The step from 8a to 4a is hazardous because oom "- 
Structs are poorly developed and poorly matched to test ape ior uim of 
from 4a to 5a involves equally undependable constructs about m behavior 
the criterion task and about how personality differences affect jo 


inference make dynamic 
(Cronbach, 1956). The added links of hazardous inference make duum 
prediction far more prone to e 
6b and 6c. 


This diagram leads to sever 
the improvement of 


relation 

i e correla 

ay not be damaging to th irpo 
) 

as of the raters resembles the bias 


rvative pre 
rror than the more conservative pP 


«ch and for 
al suggestions for validation restardi bp 
assessment. If criterion Gd is affected by oror : should 
a better criterion could be obtained. A worksample of job Sei nied? 
correlate higher than the rating criterion with 5a and 3b. The ps) 
may quite accurately predict 
yet not be able to 
evaluate that initiativ 

Even more import 
ality structure (4a) w 
Luborsky was able, h 
not able to judge wheth 
type of claim is valid, t 
translating knowledge o 
low validity coeffici 
cate th 


ed person" 


at 
alled the 
p was 


nerally 
t indi- 


þe 
H ay 
" : , ; but it m 
ong the chain of inferences, 


y to 1b. 


; he ge 
f personality into expected behavior. The § 


en 
; ic assessme 
ents in studies attempting dynamic a 
at Something is wrong al 


that 4a corresponds excellent] 


Validity of Clinical Descriptions 


The validit 


educed 
from the 


rence? 
erion 


d 
: " " nnot be 

y of inferences about personality structure ca 
assessment studies discussed 
were not recorded in a form suit 


; iden 
Pi we his evide 
was an overall evaluation rather than a description. T 


n scrip 
iesi , ith other de 
come from studies in Which inferences are compared with 

of personality structure, 


-notive infer 

above, where cx mpm crit 
3 '"here = 

able for verification, and whe ce mu* 
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The literature contains : " , 3 
kae ear 
and Grayson cask descri tions are exam s : Es "t igen alos t 
clinical impressions ai r I d m = B F S : wall hé recalled that these 
CHE ciis. ier tn O; respone ed very w ell with the therapists case notes. 
Tien ips d À LN can point to CASES ware projective techniques gave 
ie "- se unique iium of individuals, features so rare that they 
ieee E ts aly be attributed to chance. George DeVos once made a blind 

) ie Rorschach record of a research worker and, in reporting on 
inii. à structure, commented, "This man ought to be an 
details xe d = completely happy down in Washington digging minute 
Eit aua io apes archives" (these being a sat of century-old docu- 
fund: cam E jus een opened to scholars). The man was a specialist 
ta Wachi ^ pi research is most uncommon—but he actually was 
i ee . at the time the analysis was made, extracting detailed infor- 
sects te m ifty-year-old files of a Congressional committee! Such hits” 

explained away, and constitute the most persuasive evidence of 


the inferred personality 


the 
n of projective methods. 
he critical thinker must ask, however, just how often the descriptions are 
dictions. Formal test- 


about only the successful pre 
and few adequate studies have 


formed judge to state whether 
Judges tend to say that the 
about someone else. This 
tendency to write vague 


ko] perhaps we hear i 
e the validity of descriptions is difficult, 
: hee reported. One method, asking an in 
Fei. gr fits the individual, is unsatisfactory. 
“scription fits even when it was actually written 
n partly to noncriticalness and partly to the 
S Lina rn might fit anyone, e£» prefers a certain amount of change 

ariety (Sundberg, 1955; Davenport, 1952). 


A second procedure uses matching. Descriptions n 
all the individuals, or who h 


ee Judges who know | 
ones E m them, may be asked to pone es d description 
Of thi person. Judges have had a ar agen a chance level in studies 
his character (e.g. Vernon, 1935; Henry, 1947; Palmer, 1951), but this 
Alone is not an adequate validation method. A successful match may be 
made on one aspect of the description even if other parts of the description 
are incorrect, and sometimes there will be a mismatch because of one minor 
error in the portrait. A more specific technique which indicates the validity 
9f particular predictions is required (Cronbach, 1948). 
One method of considerable value when properly applied is Q sorting. 
Statements may be prepared covering dozens of aspects of behavior. These 
Statements may be judged as fitting or not fitting the individual both by the 
assessor and by others well acquainted with his behavior. Various statistical 
methods may then be applied to investigate whether the two descriptions 
Correspond. The most satisfactory procedure is to correlate all descriptions 


nay be prepared for, say, 
ave folders of case 
with the 
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NT the 
by the assessor with descriptions of each case by the wines p arr 
clinical descriptions are discriminating rather than universa f ap Pl be 
the correlation between the two descriptions for the same person 
much higher than that for descriptions of different fucum iri tol? 
No ideal study of this type is available. Samuels (1952) ies ysycholo- 
by-trait interpretations from the projective tests taken by eth 4 bn me 
gists. He found that ratings on such traits as depressed-vs.-c oan of dif- 
adjustment, and quality of intellectual accomplishments ade m a median 
ferent tests ( Rorschach, TAT, Bender, Sentence Completion ) ha " reement 
correlation of only .05-.08. This in itself indicates that there is no c eh was 
among projective interpreters. A comparable analysis of aito de pendent 
made by Hartmann (1949), with records of 35 delinquent an n 42 vati- 
boys. Two raters given a complete case record judged each boy en indi- 
ables. These "criterion" judges had a median correlation of .44, hie rpretet. 
cates how difficult it is to make trait judgments. When the TAT "s pin av 
W. E. Henry, rated each boy, his correlation with the criterion nee de 
eraged .16. Hartmann also obtained a correlation indicating ae Meng 
description for each boy separately, over all 42 scales, agreed wi je 
evaluation. While the median correlation between the criterion "T descrip- 
89, Henry's median correlation with these judges was .25. The TA tachment 
tion fell short particularly in judging aggressiveness, stability, a Jards- t 
to father, school adjustment, activity in recreation, and moral stan 


Jeny 
- study He” 
did best on taciturnity, self-reliance, and maturity. In another stucy 


s 
ata: there wa 
(1947) compared TAT descriptions with other sources of data; th 


er- 

in 8 
essential agreement between TAT and at least one other source € definite 
cent of the specific statements, Regarding only 2 percent was ther dy 


tu 

disagreement. These are much better results than in the Hartmann § 
but the comparison is less well controlled. 

The foregoing series of findings implies th 
rently interpreted are not dependable sour on 
though some reports are appreciably better than random Boe ility an 
value can be improved with further development of scoring relia f perso 
interpretative theory. No evidence is available on the adequacy © 


ts or int" 
s 3m s 

ality descriptions from observations in complex performance tes 

views. 


ues as QUE 


2 al 
iption$: *. 
TP heir 


at projective techniq 
ces of complex descr 


E] 
IMPROVING THE USEFULNESS OF ASSESSMENT METHOD have 

nit a 
Some critics of projective methods and of impressionistic assessme ho 


ists 
concluded that these methods are indefensible, and that mee repo" 
depend on them are deluding themselves and those to whom t m heil 
Confirmed believers in assessment methods, on the other hand, re} 
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Critics as "methodological bluenoses" ( Bellak, 1954) and even deny the rele- 
vance of such formal evidence as is available. This denial goes much too far; 
as R. R. Holt (1958), a proponent of clinical techniques, says, “Tf the issue 
were whether some clinicians have made themselves look foolish by claim- 
ing too much, then I should agree: these studies show that they have, and 
unhappily, brought discredit on clinical methods generally.” Since the claims 
made in the past have frequently been discredited, any would-be assessor is 
responsible for presenting indisputable public evidence of the dependability 
of his judgment; vague claims regarding successful experience will not suf- 


fice. 


On the other hand, the 
ment methods. Personality testing 


evidence does not demand abandonment of assess- 
has had a shorter history than ability test- 


ing. With personality as difficult to analyze as it is and with the available 
techniques all open to one serious objection or another, it is important to turn 
attention to how assessment techniques can be improved. It is equally neces- 
Sary to understand just what function each procedure is best for. Many of 
the attacks on projective techniques and many of the defensive arguments 
have been based on a misconception of their proper role in the study of per- 


sonality. 


Improving Test Interpretation 


d situational observations have been interpreted by 
means of whatever theory à particular interpreter adopts. Some TAT in- 
terpreters view the stories a5 samples of stable traits likely to be shown in 
Overt behavior, some consider the test a measure of strength of motives 
Which shift from occasion to occasion, and some consider the test as a 
Measure of unconscious and unexpressed drives. If the interpretation of a 
test has not yet been stabilized and verified, no one knows what that in- 
Strument might do at its best. Recent years have shown many defects in the 
theories used to interpret complex tests, and personality theory itself is un- 


dergoing substantial change. 
We cannot examine detailed 


Projective tests an 


questions about particular weaknesses in in- 
terpretation, but we can note studies having very genera rco E 
first is the finding of serious bias in certain techniques. wes ia Ac wed a E 
see also Samuels, 1952) asked judges to indicate y pro e ape 
, a à - adants fr ; 
Subject, using a multiple-choice test based on incl en eres 


i ject’ e, Sex, 
group of judges responded knowing only the subjects ag® 
background. Other judges had Rorschach and TAT records. Answers were 


scaled from 4.0 (implying excellent adjustment) to 10 (End 
maladjustment). The median of the responses representing what ot an 
actually had done was 2.5. Judges relying on general background data & 
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a generally favorable assessment (median prediction of his response scaled 
at 8.1), but judges relying on projective data were unfavorable (median 
response 2.0). On the whole, the projective interpreters misjudged the sub- 
ject whenever his actual behavior showed good adaptation. In uncertain 
situations, they expected the worst of him. 

This apparent bias toward psychopathology in Rorschach interpretations 
is implied also in Roe’s work on eminent scientists, the Menninger psy- 
chiatrist study, and other investigations. Within any group of normal persons 
one encounters records which, considered by themselves, would seem to be 
indicative of gross emotional disturbance or psychopathology (Gallagher, 
1955). As one Menninger assessor commented, “The TAT usually exposed 
man’s weakest points, without giving compensatory signs of his strong One 
- .. To some extent the same thing is true of the Rorschach. . - - Latent 
conflicts show up in these tests much more plainly than the compensating 
strengths; . . . [it is necessary] to be very cautious in assuming that such 
potential liabilities are actual if they are not seen operating in much more 
direct fashion” (Holt and Luborsky, 1958, p. 246). 

One reason for the bias is that projective theory was developed through 
the study of mental patients without appropriate control studies of normals. 
Only fairly recently have extensive data on normals and superior individuals 
been collected. A projective technique reveals drives and impulses, but it 
does not indicate clearly how they are controlled. Strong hostility is likely 
to be an unfavorable sign in a person tested by a clinic; in an executive var 
thor, or school superintendent, the same force may be harnessed to creativ? 
and socially constructive activity. Until further research enables the tester 2B 
distinguish unchanneled from controlled forces, one must interpret dann 
ing indications with caution. Ultimately, we may learn to identify con"? 
Ee a ide as disruptive forces through projective protocols o 

performance tests are section” 

personality. On the contrary, they are observations in a specific situate 
oo to future situations only with considerable risk. cn 
d YT Ni cold, forbidding female Rorschach | 
strongly dpud lab "a uu MEE dera, 

; mpulsive behavior from the same subjects who gave par Pts 
passive, unimaginative responses when tested by a soft “mother-fig™ e 
Moreover, deliberate efforts on the part of the lc to be more pe 
eee altered the test performance. No test is a measure of personality 2 
isolation; it is always a sample of social interaction with a specific other Pe 


| n 
son (Schafer, 1954; Sarason, 1954). There is considerable risk when ° 
generalizes to other socia] interac 

“Blind analysis” 


not comprehensive cross 


à tions. mine 

; P TC r 

what can be do ae custom in validation studies aimed to dete be 
ne with a single test, but practical interpretation shou 
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nd knowledge. Indeed, Holt and Luborsky 
ctive tests only along with intellectual tests 
and a case history. “If one could give only the Rorschach and TAT, it would 
be better to give no tests at all rather than spend time with so dubious a 
prospect of satisfactory results. Projective tests give valuable insights into 
personality, but the level of material from which they draw varies so much 
s significance is so dependent on a framework 
hat projective techniques can make 
tion with other meth- 


based on considerable backgrour 
wisely recommend study of proje 


from one case to another and it 
of realistic knowledge about the person t 


their proper contribution only when used in conjunc 


ods" (p. 303). 
The statement about "levels from which they draw" makes reference to 
terpretation of projective tests. 


one of the most confusing problems in in 
Sometimes a story or free association appears genuinely to reflect deep-lying 


repressed conflicts, but one cannot be sure which records or parts of records 
have such hidden meanings. The interpreter will do well to heed Schafer's 
warning (1954, p. 150) against “arbitrary, presumptuous efforts to deepen 
interpretation in spite of the patient.” 
Still another difficulty which can be remedied only by improving per- 
Sonality theory is semantic confusion. If the test interpreter uses words 
which mean different things to different people, he cannot hope that his 
interpretations will be confirmed or that they will be practically beneficial. 
Many of the key words in dynamic interpretations are highly ambiguous. 
Grayson and Tolman (1950) asked psychologists and psychiatrists to define 
such words as bizarre and aggression. Twenty-three of these clinicians de- 
fined the aggressive person as hostile and destructive; but 21 of them used 
the word to describe positive; assertive, dominant behavior. As the authors 
said, "The most striking finding of the study is the looseness and ambiguity 
of many of these terms. + > - For the most part the lack of verbal precision 
Seems to stem from theoretical confusion in the face of the complexity and 
a. Verbal discrepancies can 


logical inconsistency of PSY Jhological phenomen 
psycholog ; : 
only be ud Dy a deeper understanding of these underlying phe- 


nomena which will require many years of careful, penetrating, and analyti- 


cal psychological experience. 
Situations 


f Treatment 
ngly introduce à selection plan with- 


terion information, but the assessor 


ms, The assessor has picked. Army offi- 
a better standard 
nts 


and fliers with no 
; situations involved. While one 
wartime necessity, 


Psychological Study © 


ster would willi 


No psychometric te: 
gainst enl 


out first validating his tests ? 
4S generally made blind predicti 


Cers TN 
than civil servants, espionage 38° ^ 
his hunches about the demands of the 


mi rounds of 
ight excuse such presumptuousnes? on the grou 
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when men must be selected according to someone's best guess, prudence de- 
mands realistic job analysis and test tryout when circumstances permit. 
Nearly all the validations of assessment methods have examined the merit 
of “naive clinical assessment,” which Holt (1958) describes as follows: 


The data used are primarily qualitative with no attempt at objectifi- 
cation; their processing is entirely a clinical and intuitive matter, and 
there is no prior study of the criterion or of the possible relation of the 
predictive data to it. Clinical judgment is at every step relied on not 
only as a way of integrating data to produce predictions, but also as a? 
alternative to acquaintance with the facts. 


The choice of sound psychometric methods and interpretations has, from 
the days of Wissler and Binet, depended upon thorough empirical follow-up: 
similar follow-up is even more essential in personality appraisal, where prob- 
lems are more complex. Holt suggests the following as a "sophisticated" clint- 
cal method: 


" H 1 i d 
Qualitative data from such sources as interviews, life histories, 2” 


projective techniques are used as well as objective test facts and scores 
but as much as possible of objectivity, organization, and scientific 
method are introduced into the planning, the gathering of data, M" 
their analysis. All the refinements of design that the actuarial traditio" 
has furnished are employed, including job analysis, pilot studies, me 
analysis, and successive cross-validations. Quantification and statis- 
Hes are used wherever helpful, but the clinician himself is retained a8 
aoe of the prime instruments, with an effort to make him as reliable " 

valid a data-processor as possible; and he makes the final organization 
of the data : al case 
the 
ha 


to yield a set of predictions tailored to each individu 


Thi ; 
med procedure Was applied as well as possible in the second phase o! 
mger psychiatrist study. Naïve assessment, as described above: 


> i s t ti- 
ag applied with mediocre success, validity being .24. In the ee 
cated study” the investigators examine pa 


me in the 
Be Me a concept of the good psychiatrist. Specific cues i pu 
other data were identified, to provide an objective framewor 


pi ie a cases. Despite this effort, the resulting scores or ae 
e i s 
ped in this manner had poor validity when based on ĉ for 


ee test. Predictive ratings based on all data had validities of ^ le 
one judge, 22 for the second (average, 40). Holt and Luborsky concio 
that the validities from the final Bon Mes. Mes redee and recomme” 
application of refined assessment m y P: hiatris" 


t aspe i ethods to the selection of psy° 7 is 
ei ppears still to be in doubt, however. The one coefficient 


high, but it gives no assurance that the validity of judges would 


d enough successful and unsucC 


of. 
be com 
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ive predictions. Moreover, when Verbal IQ corre- 
hard to believe that clinical judgments repre- 
to justify the labor involved. Even with 
es of men whose criterion scores are known, 


sistently superior to the na 
lates .39 with the criterion, it is 
sent an improvement sufficient 
careful analysis of the personaliti 


assessme i j 
oo must overcome serious difficulties. 
very test repr S i i ific si 
) presents performance in a highly specific situation, as we 


Ferdi there is a counterpart problem yim respect to the criterion. 
anak " rating is generated by the specific interaction of a man and one 
E I The psychiatrist who might do well with children can be rated 
n if he proves unable to o with hospitalized adults early in his 
eat training. An unfortunate à; an incompatible first 
ane a may develop feelings of incompetence and a bad reputation 
the pe pr event the man from reaching his potential. Likewise, the success of 
hh i. hoanalysis the student of psychiatry often undergoes m a significant 
to en a unpredictable feature of the situation. Assessment Is d nsed 
tive à men for uniform, well-defined jobs. The usual problem in execu- 
dls or clinical evaluation is to judge how the individual wall get 
ition. : an ill-defined or unspecified situation. Where many variable con- 
s intervene between prediction and follow-up; high validity cannot be 


hoped for. 
O = 
ne might hope for test data to 


ope 
first assignment or 


predict the average success of a man in 


2 independent situations. À statistical composite suce a college grade 
1948 wi a "convergent phenomenon pi rna 1943; L. on Frank, 
nd espite lucky or unlucky experience "d single po te average 
are adda more and more stable and hence easier to pre 3. as more courses 
oily at ed. All-round popularity is ar convergent p. i ied A per- 
ioi anding may vary greatly from C o office to bowling team, but as 
e ra oups are added his average "see : A phenomenon is said to 
ivergent if the successive events that cause it to develop are highly 
One stone jars 


inte: 
Ne ataa: A landslide is an example. another, the two 
ing together dislodge others, and soon an irresistible stream of debris is 


Pouri 

* wing downhill. This force is à sum of many separate movements, but not 

lip s of independent events. Rather, every added stone 15 an amd- 
cation of the original movement; if the first stone had not moved, there 


Wo 
uld he 
have been no landslide. . 
pa is not poss! 


Tediat 

^ Ke m of divergent phenome Be 

n ed ("that hill looks loose enough to slide"). ) 1l ! 
wt ite only on the average over independent situations ( Landslides 
enti Cost the state road department xt ds of dollars”). The social sci- 
isse. Can predict accurately how many i a college class will 
Y: He can predict much less well w articular woman will 
t night is extremely 


Mar, 
Ty. Whether she will marry the man 


a simil 


ble. Possibilities can be 
but what will occur can 


hether a P 
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uncertain. Successes and failures do not average out; one qune e ee 
wrong time may end all chance for compensating pleasant di và 
with full day-to-day information, the fate of this possible marriage 
unpredictable for many months. £ a certain 
The assessor can reasonably hope to figure out how many men of a o dim 
type will succeed in psychiatry. He can perhaps judge what any ird ar 
would do on the average if he could have ten independent aee p 
chiatry. But any one career is like a horse race; a delay in the o Ds 
a jam on the track, one mistake by his rider—and the favorite TE I 
psychiatrist has just one career, in one group, under one set of eai 
he establishes a good relation with the significant figures in this “ie that re- 
the beneficial consequences will rebound through his whole life. E salies 
lation depends on chance events as much as on his stable persone T aton 
As William James warned, psychology can establish general expe 
but cannot hope to give biographies in advance. 


4. Show that each of these is the result of a divergent phenomenon: 


a. The ceremony was a moving emotional experience. 

b. Terry is coöperative, but his brother Mike no one can manage: 

C. Charles' interest in science is becoming focused on genetics. 

How reasonable is it to try to predict each of the following? peada 
a. Will men of this type respond better to close supervision or to fre 


b. How will this man respond to close supervision? 
c. Will Mark like selling? 


d. Will Mark like this job as salesman? the 

6. Defend the statement: "After a certain point in its development, dé 
phenomenon becomes predictable.” What sort of information is nee 
Prediction? What does this imply for the psychologist? 


ivergent 
at this 


The Unique Functions of Assessment Procedures 


do a job 


In the writer’s opinion, assess e exte?" 


for which they are ill suited. It 
sive and discouraging neg 
predictors, but there is a 

Assessment techniques 
from conventional psych 


ment techniques have been asked ai 
has been necessary to emphasize +! ues ? 
ative results on the use of clinical jer e. 
nother, more positive evaluation to "n m apart 
have three related features which set the as fol- 
ometric methods. Stated simply, these 47° 


lows: " 
do 
PS ts ns an 
They provide information both on typical response patter 
stimulus meanings, . dividu?! 
They cover a j bout the in jvid" 
y Very large number of questions a di 


ra . nt in 
They provide information about different questions for differe 
uals. 
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fel j ; 5 
" overage of Stimulus Meanings. The psychometric approach is to confront 
e individu: 1 i 
iari ndividual with a carefully selected task or set of tasks which represent a 
: erion situation in some way. This description applies to proficiency tests 
O apti TEN P" pm s sl n i 
S ptitude tests, to questionnaires on typical performance, and to worksam- 
I performance tests such as the LGD. We saw that even impressionistic 
in istati i i icti 
terpretation of such samples of behavior gave valid predictions for civil 
servi P ; pi^ A 
S and OCS selection. The essential assumption in this type of testing is 
at we can generalize from a sample of behavior to performance in one class 
of situations. 
A person's behavior changes from situation to situation, however, and 
and the person as a whole, or must select situa- 
thin one class is impos- 


w UM 
henever one must underst 
y sampling w 

Much of the 


tions to fit him, a simple prediction b 


Sible. One must begin to learn what situations mean for him. 
content of an interview deals with situational meanings: attitudes toward 


parents, former employers, school subjects, etc. The thematic projective tests 
elicit similar information, though in a more disguised and perhaps less cen- 


Sored form. 

Hs Semantic Differenti 
ü dy meanings the person giv ; 
inna structured and quantifiable, js interpreted im 
4 single individual is under study. Hence there 15 no ps 


o ing ies cha 
Y obtaining information about the subjects 7 eactions to Y 


Situations. unless one wishes to prepare dozens of questionnaires or Q sorts, 
son or situation. While research along the lines re- 


and Os Kelly may lead to well-controlled psy- 
- re is no alternative to some type of 
al information covering à wide range 
o controlled validation 
as TAT and Semantic Differ- 
ystematic validation of im- 
as measures of traits 


hometric technique designed to 
hers. Even this procedure, 
pressionistically when 
ychometric technique 
arious persons and 


al is the only psy¢ 
es to significant ot 


ES dealing with one per 
c ue pened by Osgood 
d kai wr techniques, at t 

al assessment if we want 


of ob; 

riso It is unfortunate th ae 

ential į d show just how well such proce i 2 
Tessi : ety significant attitudes. Virtual y all s 
. *SlOnistic methods has examined their adequacy 


le 
.0 : à 
» of response information ). 


his time the 
attitudin 


7 " P ik 
* What value does information about situational attitudes have for a decision 
f these situations? 
a. Counseling a couple havi ital difficulty. 
€ Appraising junior executiv 
a= Evaluating a student on probatio : 
d hy is information about attitudes tO diver A 
9 ealing with divergent phenomena than with convergen qa Tw 
valuate the s tion that situational attitudes might 5e assessed by ad- 
ministerin y aids of scorable questionnaires, each dealing with a 
differen ata noi extent would this overcome the limitations of 
i e-object. 


im rias 
Pressionistic assessment? 


se situations more important in 
henomena? 
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10. Is the assessment of stimulus meanings really distinguishable from assessment 
of traits? (Example: Is “hunger drive" distinct from “attitude toward food"?) 
List the persons or situations about which attitudinal information would be use- 
ful in clinical study of a bright 9-year-old child who is unable to read. 


Ti. 


Bandwidth. Shannon’s "information theory” (1949), developed for a 
study of electronic communication systems, provides a model for considering 
the second important feature of assessment methods. He distinguishes tw 
attributes of any communication system: bandwidth and fidelity. - 

Home record players have made "high fidelity" familiar to everyone. ber 
complementary concept of bandwidth refers to the amount or complexity i 
information one tries to obtain in a given space or time. The fidelity of edi 
cording depends upon the width of the groove; if grooves are crowded al 
gether to put more music on a record, fidelity suffers. Fidelity could be 
proved over present standards by designing record and playback p open 
which would carry less information (e.g., a 88-rpm record lasting only the 
minutes instead of thirty). With other things held constant, any shift in i 
direction of greater fidelity reduces bandwidth; and increase in je 
may be purchased at the price of fidelity. In any particular "rs 
tion system there is an ideal compromise between bandwidth and fide ei 
The record industry settled on the 33-rpm “long-play” record; the F CC à 
lows the FM station a bandwidth of 22 kilocycles. y and 

The classical psychometric ideal is the instrument with high fidelity à S 
low bandwidth ( Cronbach and Gleser, 1957; Hewer, 1955, pp. 3-19). dli t 
lege aptitude test tries to answer just one question with great nudo Pn 
concentrates its content in a Very narrow range, using correlated items tO " 
crease reliability. Because its parts are highly correlated, part scores ur 
little information for choosing majors or diagnosing weaknesses. Most © ing 
excellent predictors such as the LGD participation score and the pee rati 
have similar limitation to one central variable. 

At the opposite extreme, the int 


que have 
almost unlimited bandwidth, 


ote thre? 
topics ” 
dies 
tan 
ents 


re, the interviewer may cover twenty 
larger number of traits. In some TAT = 
an forty variables, all on the basis of abo 
description adds a dozen or more statem 
itudes not commonly encountered. 
ediate bandwidths, and a particular e 
be used as a narrowband method by cedo yer 
od by others. All the validity studies we ^ pas 

n's principle: increases in complexity s ‘a 1Q 5 
Y sacrificing fidelity. The Wechsler Verba alue 
Subtest scores are of some but quite limite 


: e 
technid" 


viewed substantiate Shanno 
tion are obtained only b 
highly valid. Patterns of 
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ems, or judgments about ob- 
The most successful combina- 
are the GATB and the 
ny alternatives 
about a dozen 


And interpretations of responses to single it 
"iip processes are distinctly untrustworthy. 
tions of large bandwidth with relatively high fidelity 
SVIB, both of which are designed for counseling where ma 


must be considered and useful prediction can be made from 


Scores, 

" Extremely large bandwidth is disad 
€comes too unreliable for use. Extremely small bandwidth, on the other 

there is one specific, all-important question 


t should be devoted. While no rule 
an point to con- 


vantageous because the information 


um is appropriate only where 
€ answered, to which all testing effor! 
Ein be given specifying the ideal bandwidth for testing, we c 
ditions favoring wider or narrower bandwidth: 

© The first is the number and relative importance of decisions to be made. 
an institution is concerned with a simple decision and only one outcome, it 
Should concentrate on the information most relevant to that decision. (Ex- 
ample: a college wishing to admit students who will make good academic 
records, without regard to values, social or emotional adjustment, or proba- 
le Post-college career.) If many outcomes or alternatives are to be con- 
Sidered, more types of information are needed and bandwidth must increase. 
Counseling, diagnosis, remedial teaching, and supervision of professional 
Workers generally involve multifaceted decisions. The testing effort should 
€ balanced to obtain relatively dependable information on the most im- 
Por tant questions or those which are most likely to arise. It is better to ignore 
liy questions than to spread one's inquiry too thin (Cronbach and Gleser, 


1957, p. 96). 
- Bandwidth can be greatly increased when it is possible to suc = 
jen judgments at a later time. Lack of fidelity does no iae T à 
fire. to costly errors. Narrowband instruments are desired for ma ing final, 
ersible decisions about important matters (e.g. scholarship awards). 
a ? wideband technique, on the other hand, serves well as the first stage in 
s, duential measuring operation. As a first stage, the ore wa Er 
quie cially a range of important variables, pointing out significan e ji 
es for further study. In this use the wideband procedure is used for hy 


Pothesi s 
sis formati nal decisions. ] 
; uy EET Strong blank, for example. It is not a 


Is is ps a t 
ighly ux ye ap i It is an inexpensive pencil-and-paper in- 
nid Which gives » excellent preliminary mapping of the tg 
p Its ease of administration, objective scoring, and id pedi 
Or to the unconstrained interview (which has even grea er i : ; 
z lowing the test, the counselor uses 2 more focused interview fo se: s 
igh Scores and to "determine their implications. Even this discussion show 


i thre 
Not lead to a final decision. It is better to narrow the choice to two or e 
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i nd by 

areas; these hypotheses can be tested by enrolling in suitable courses a 
trying relevant summer jobs. . ssments 

poems jan aa for follow-up and — OR Fal- 
or score interpretations exist in virtually every decision Pad lor a pads 
lible tests can suggest assignments for an employee, een Lerner han ^ 
teaching techniques for a student. Even if the "us il icing. Sheo 
guess, it has some value when there is no sounder basis : i hypothesis 
ing out the hypothesis permits verification, and change 2 cds rallibility of 
was wrong, little has been lost. We may say, in sum, init end suggestions 
wideband procedures does no harm unless the hypotheses cud, aga of 
they offer are regarded as verified conclusions about p inc scu BU 
course some degree of skepticism is required in vom mde 3b var be. 
any psychological test, however precise and paama = in elinical se 

Impressionistic procedures, and psychometric proces Po à Rorschach 
tings, are chiefly used for hypothesis formation. Clinicians a “ol considere 
interpretation or a Wechsler IQ to a case conference, pone pisse try 
along with other data, and this conference concludes tiat dti as when 
one therapy than another. Only where the decision "ry forever left in 
surgery is prescribed or where the patient once classified is 


ar too 
tried. Unfortunately, assessors (and psychometric testers) vet = two bad 
claimed that their methods give valid final conclusions. Tus Men delive» 
consequences: nonpsychologists expect more than the ee eecription o! 
and the psychologist tries to live up to his claim by apis doc 
recommendation instead of outlining the reasonable alternatives. 


nication 
12. 


x ww a commu 
Defend the analogy of a psychological examination to 
System. 


d? the 
Would you characterize the DAT as wideband or narrowban 
the test of flicker-fusion frequency? 


Did the Hartshorne-May te 
band procedure? 


tha 

g ideband 

15. Why is sequential testing of hypotheses more important for wi 
narrowband procedures? 


MMPI? 
13. 


narrow” 
n and or 
14. sts of honesty serve better as a wideb 


n for 


ing comme 
Adaptation to the Individual. Closely related to the foregoing 


a to 
" esting ion 
the advantages of assessment procedures for shaping the t met d esti rJ 
: ; an a 
dividual. The psychometric tester standardizes his test to a! tester May T 
presumed to be important for every 


one. The impressionistic 
the problems and topics covered 


individu2^ ^. 
by the testing to fit the po g pr? 
X EIE 
psychometrie tester tries to standardize every aspect of his 


each 
] : ion is CHER about 
cedure, so that precisely the same information is o tained 
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r wishes to obtain whatever information is most 
ar individual, even if this means asking dif- 
he flexibly administered interview, the in- 
red projective technique elicit idio- 
hich there is no counterpart 
be interpreted im- 


ject. The impressionistic teste 
significant regarding a particul 
ferent questions of each person. T 


dividualized Rep test, and the unstructu 
ant responses for w 


syncratic, personally signific 
These responses can only 


in psychometric methodology. 
pressionistically. 

Meehl (1954) gives several examples of such interpretations, which he 
properly regards as the essence of the clinical art. One is from the psycho 
analyst Reik ( 1948, p. 263): 
se. After a few sentences about 
ce. She assured me that nothing 
minutes she complained about 


Our session at this time took the following cour: 
the uneventful day, the patient fell into a long silen 


was in her thoughts. Silence from me. After many 
a toothache. She told me that she had been to the dentist yesterday. He had given 


her an injection and then had pulled a wisdom tooth. The spot was hurting again. 


New and longer silence. She pointed to my bookcase in the corner and said, 
There’s a book standing on i Vithout the slightest hesitation and in a 


ts head.” V 
reproachful voice I said, “But why did you not tell me that you had had an abor- 
tion?” 


he patient's chain of associa- 


nference from t 
kill is compounded of theory, 


How Reik made this correct i 
ern here. The s 


tions and silences is not our cone dme 
imagination, experience; and willingness to make (and verify or discard) 
interpretation, which might ac- 


rash guesses. The important point is that this i 
celerate appreciably t ld not possibly have been reached by a 


he therapy, COU 
formal testing procedure. In the first place, such a procedure would be un- 
likely to touch upon the particular ortion. Even if it did, there is 


topic of ab 
no “trait” on which the response could b d, unless one envisions key- 


e score 
ing the MMPI to distinguish ex-abortion patients from other women—and 
roup having conceiva 


similarly for every other g ble clinical interest. Sec- 
ondly, the response cannot possibly be interpreted by any multiple-regres- 
Sion or other tone pr ild one establish frequency tables to 


ocedure. How cou 
give the meaning of pone putasti ama m E 
observation-about-an-inver 


ted-book? This is à unique datum to be inter- 
preted only by a creative a 


ct of applying such a theory as the psychoana- 
lytic hypothesis that tooth extraction is à disguised symbol for birth. This 
extreme example of s ymbolic co shows clinical idiographic 
interpretation in its purest form, but unique content 15 interpreted by every 
assessor. 

The interpreter must likewise deal with the unprecedentet apart 
dicts response to a specific situation. “Should this enit À i :: tistics cann s 
mother or placed in a foster home?" is a decision 1n pisei wire dida ih 
aid. No experience table can predict from IQ, anxiety level, or anything 
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whether he will adjust well to his mother. This can be estimated only from 
her particular character, the child’s character, and the precise home ol 
tion. Any decision about this problem is likely to be wrong, but that is bes! 
the point. The decision must be made, and insofar as psychological study : 
the child can improve the decision, the risk of error is reduced. In this T 
impressionistic appraisal is the best available basis for decision. All the d 
tle” decisions that take place from minute to minute in therapy and teaching 
are similarly resistant to measurement and statistics. In these judgments 
where the psychometric tester would have nothing to say, the hints Mut 
the TAT or a case history may provide valuable guidance ( Meehl, em 
p. 120). we 
The difference between psychometric and impressionistic assessment, pe 
find, is not that one uses multiple-choice questions and one uses €t" x 
that one is compulsively cautious, the other erratically overambitious: — 
two approaches to observation and interpretation are suited to different p 
poses. When clinical testers answer questions for which their methods pe 
theory are badly suited, their answers are next to worthless or at prn 
costly beyond their value. When psychometric testers are faced with a Fe £^ 
cal problem calling for understanding rather than simple evaluation (^ to 
what lies behind a given child's anxious withdrawal?) they are un? the 
give any answer at all. Each in his own proper province will pac ; 
other and each outside his province is nearly impotent. Assessment me sure" 
have earned a bad name for themselves by trying to compete viscum arch 
ment techniques on their own ground. In the absence of excellent a 
to guide the combination of information, the wideband technique - he 
not be advanced as a means of predicting specific, recurring gutane into 
precisely focused instrument, on the other hand, should not be exalte only 
the sole approved technique for gathering information. It is efficient 
when the decision maker asks the particular question for which it has ionis 
designed and validated. Even the TMC must be interpreted impress 


à ance » 
tically when one wants to explain a low score, or to predict performan 
a new training program. 


tech- 
We expect an evolution from naturalistic to highly structured 


niques: Alfred Binet began to explore and define intelligence by meer et 
impressionistic interpretations of imaginative performance. It was only the 
this study had disclosed the important variables that he began to desig ality 
structured tests from which all later ability tests sprang. The persone 4 
questionnaire developed out of psychiatric observations of symptoms ob- 
the interest questionnaire out of counseling interviews. Pure naturalist g 
servation is always the first step in science, followed by gradual struc " 
of the observations, and ultimately by definition of specific varie ent 
quantitative measurement, Whenever the importance of some trea 
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face pees becomes great enough to warrant quantitative measurement, 
d f epi procedures can be developed to measure them. The 
P y 1ometric method can give increasingly more refined and trustworthy 
answers to any recurrent question than can an exclusively natural-history gb: 
servational approach, because successive stages of research eliminate sources 
of error. 

This does not mean that impressionistic methods will or should ultimately 


ps. ee There will always be unique problems to deal with and unique 
vo ie e ange er» every paso is unique in certain ways, each 
; present some problems which are beyond the reach of standard 
interpretative formulas. Moreover, treatments will continually change, and 
judgments about assignment to the new treatments will have to be made 
Without waiting for years of follow-up research. Even the growth of psycho- 
metric testing creates a demand for greater and more skillful use of wide- 
band techniques. As more and more specialized scales are developed for 
traits, and situational meanings, it will become even 
wideband procedures for use as a first stage 
are relevant to each per- 
dures will always be 


measuri ; 
neasuring aptitudes, 
t to have suitable 
psychometric scales 


nistic testing proce 


more importan 
to determine which of these 
son. Psychometric and impressio 
needed to supplement each other. 

rinciple introduced very early in 
f tests as good and recommend it 


Here again we find illustration of the p 
ype of psychological infor- 


this book: one cannot identify one group 9 


for use. For every type of decision and for every t 
mation, there are many techniques and many specific instruments. The in- 
icality, in the degree of training required to use 


struments differ in practi 

them, in the variety of information they obtain, and in fidelity. The instru- 
ment that works best for one tester will not be best for another tester mak- 
ing the same decision. Tests must be chosen by a highly qualified profes- 
sional, worker who has & thorough understanding of the institution and 
persons he serves. 

All in all, psychological testing is 2 

well boast of. Errors of measurement have been reduced year by year, and 
the significance of tests has been inc 
can society feel the impa 
marriage, governmental policy, an 
been aided by tests. Interpretations 
lives by guiding a man into as 
der therapy which will avert mental 


failure in school which could turn a € 
are now available which, if used carefully by respons 


unearth the talents in the population and identify per 


reased, un 


d character-buildin 
of test data are 
k, by placing & 
or by detecting causes ofa 
dividual. Methods 


can 


uitable lifewor 
disorder, 


hild into a beaten in 
ible interpreters, 


sonality aberrations 
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PER hniques, 
which would cause those talents to be wasted. Building on these tec q 


an 
: i s of huma 
we are in a position to capitalize as never before on the richnes 

resources, 
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Cooperative Achievement Tests, Wm 
Cooperative School and College Ability 
Tests (SCAT), 141, 230, 286 
Coórdination, psychomotor, 303-5 
See also Complex Coórdinatio: 
Correlation, 110-115 
computing guide, 111, 12 
multiple, 339 ff. 
Costs of testing, 146-147, 309 
Counseling, client-centered, 293-296 
prescriptive, 297-299 
Crawford Structural Visualization Test, 
94 
Criterion, 103, 108, 329-331 
Critical incident technique, 324 


298-297 
592-594 
6, 579-581 


37, 57, 173, 383- 


30, 254, 304- 


n Test 


4 
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Critical score, 334—338, 342 ff. 

Crossvalidation, 355 

Culture, effect on test score, 182-185, 203- 
204, 217, 237—243 

Culture-Free Intelligence Tests, 230 

Curriculum and achievement tests, 362- 
368, 396—400 

Cutting score, 334—338, 342 ff, 


Davis-Eells Games, 230, 240-242 
Day record, 531 
Decisions, individual, 284 
institutional, 284, 324-358 
types of, 17-20 
Delinquents, test performance, 188, 205, 
481 
Dentistry, prediction of success, 280, 306 
Detroit First-Grade Intelligence Test, 223 
Deviation IQ, 171 
Dexterity, 73, 303, 307, 341 
Diagnosis, educational, 890-392 
psychiatric, 484—485 
Diagnostic Reading Tests, 390 
Difference, reliability of, 287 
Differential abilities, 269—292 
Differential Aptitude Tests ( DAT), 40, 71, 
88-89, 92, 269-271, 275 ff. 
case record, 86, 290, 291 
validity data, 118, 278, 384 
Differential validity, 357 
Difficulty, 134-135 
Directions for tests, 37-49 
Distribution, 76 ff., 135 
normal, 83-85, 185 
smoothed, 82 
Divergent phenomenon, 599 
Division on Child Development, 521, 632 
Dominance, tests of, 468 
Draw-a-Man Test, 207 
Durrell Analysis of Reading Difficulty, 390 
Dynamic interpretation, 455—456, 479-581, 
592 ff. 


Educational success, prediction of, 173, 
242, 277, 320 
See also College success 
Edwards Personal Preference Schedule, 
450—451, 487, 496 


Einstellung test, 547 f. 
Embedded Figures test (EFT), 547—558 


effect on test performance, 54 ff, 


athy, 20 
piel keying, 328, 406-408, 427, 456— 
459, 
irical method, 103 
Peering. selection and guidance, 321, 
E'354-331, 345, 407, 420-423 
Equal-appearing units, 71, 385-387 
Equipercentile method, 92 
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Equivalence, coefficient of, 137, 140-142 
Equivalent forms, 145 
Error of measurement, 126 ff., 288 
Essay test, scoring of, 65 

See also Recall vs. recognition 
Essential High School Content Battery, 

383 

Ethics of testing, 11-13, 459—462 
Evaluation and Adjustment Series, 383 
Evaluation form for tests, 147—152 
Evaluation of treatments, 19 
Examiner, effect of, 60-64, 596 

See also 'Tester 
Expectancy table, 72, 387 
Extrinsic validity, 58 


F scale, California, 446 
Facade, 446-451, 453 
Face validity, 143 
Factor analysis, 247—268 
Factors, ability, 256-264 
interest, 410, 436 
personality, 467 
psychomotor, 307—308 
Faking, 446—449, 458, 513 
False positive, 334, 478 
Fatigue, 43 
Feeble-mindedness, 169, 173, 205 
Fels wi Behavior Rating Scales, 526- 
Fidelity, 602 
Field observations, 38, 440-441, 528-538 
Flanagan Aptitude Classification Tests, 292 
Flicker fusion, 546, 552, 557 
Forced choice, 450—452, 512-516 
Foreign language aptitude, 320 
Forms, comparable, 145 
Four-Picture Test, 574 
Free responses, scoring of, 65 
French Test of Insight, 573 
Frequency distribution, 76 ff., 135 
Frustration, 33, 540-541 


General Aptitude Test Battery (GATB) 
82, 272-276, 280 fE., 342 ? 
General educational develo t, ti 
cde aene pment, tests of, 
General factor, 215, 250, 258 


General mental ability, 30, 164, 243-246 
as factor, 215 
group tests, 214-243 
summary list, 228-933 
historical background, 157—163 
individual tests, 157-208 
summary list, 206-208 
overlap with achievement, 223-995 
predictive validity, 176-181 
preschool tests, 183, 208-212 
spectrum of test types, 235-237 
Generosity error, 506 


Gesell Developmental Schedules, 212 
Gifted persons, 173-174 
Gordon Personal Inventory, 496 
Gordon Personal Profile, 116, 496 
Grade average, prediction of, 72-78, 116 
Grade norms, 385-387 
Graves Test of Design Judgment, 318 
Griffiths Mental Development Scale, 212 5 
Group behavior, observations of, 566-56 
Group factor, 250 
Group test, 35 
Guess Who test, 519 
Guessing, 49 
Caldane differential ability tests, 269-291 
interest tests, 431—436 
See also Counseling m 

Guilford-Shneidman-Zimmerman Intere 

Survey, 437 292 
Guilford-Zimmerman Aptitude Survey» ci ds 
Guilford-Zimmerman Temperament M 

vey, 496, 584 


Halo effect, 508 è 
Haggerty-Olson-Wickman Behavior Rating 
Scale, 511 
Hand-Tool Dexterity Test, 306 
Handwriting, 65-66 
Hanfmann-Kasanin test, 559-560 pility. 
Henmon-Nelson Tests of Mental A 
220, 221, 222, 230 481 
Heston Personal Adjustment cor 49 
High school testing program, > ' a96, 
Historical eee Oe 157-163, 394-39 
464—469, 581-582 " 292 
Holzinger-Crowder Uni-Factor TOST 828, 
Homogeneity of test items, 215, 
M mes Equivalence 556 
See also Equiva 
Honesty, tests oe 549-544, 552, 554, 
Horn Art Aptitude Inventory, 81 T 
House-Tree-Person (HTP) Test, Scale, 
Humm-Wadsworth Temperament 


468 94 
Hypotheses, verification of, 20, 121, 4 


Idiographic analysis, 499-504 
LE.R. bere Law X 
Illinois Art Ability Test, 348, 
Impressionistic testing, 24-28, 63, 346 

564, 579-607 
Incentives and test performance, 
Indians, test performance of, 1 
Individual decisions, 284 
Individual test, 35 
Industrial applications of test: 

306, 342, 393, 460 — 9 
Industrial arts, prediction of suci 

306, 340 
Infant development, tests of, 208-212 
Information theory, 602 


52-58 


s, 9, 118. 228, 


Institute of Personality Assessment and Re- 
search, 587 
Institutional decisions, 284, 324—358 
Intelligence, 160, 164, 244-246 
social, 319 
See also General Mental (ord 
Intelligence quotient (1Q), 102, 170 ft. 
distribution, 171-173 
interpretation of, 173-174 
stability, 176-179 
as standard score, 171 
Interaction recorder, 536 
Interest inventories, 405—439 
interpretation, 498—434 
stability of scores, 418-419 
summary list, 437 
Internal consistency; 141-142 
Interpretation, dynamic, 455-456, 579-581, 
592 


92 ff. 
to subject, 431-434, 487 
Intervals, equal, 71, 385-387 
Intrinsic validity, 58 
Introversion, tests of, 466—468 
Inventory, see Personality; Interest 
Iowa Tests of Basic Skills, 383 
Iowa Tests of Educational Development 
(ITED), 383 
Item form, 371 
Items, selection of, 364-367, 406-408 


Job analysis, 325-327 
Job performance, pred 
925-228, 279-280, 281, 3 
842, 485 
Job replica, 304-306, 215-514 
ob satisfaction, 42 
an errors of, 346-348, 506-510 


1-42, 82, 558- 


iction of, 116, 217, 
06, 312-314, 


Kohs Block Design Test, 4 


560 
Kuder Preference Record, 412-417 ff, 437, 
448, 450 


Kuder Preference Record—Personal, 496 


Kuder-Richardson formulas, 141 
Kuhlmann-Anderson Test, 218-224, 230 


ptitude for, 
(LGD), 566- 


Language, foreign, 8| 820 

Layman, appeal to, 142 

Leaderless Group Discussion 
e t of, 118, 516, 520, 


Leadership, assessmen 
566—568, 582-589 


Lee-Thorpe Occupational Inte 
tory, 438 i 
Leiter International Performance Scale, 
Length of test, 130-132 
Lewerenz Test of oe 
Visual Art, 31 
Lorge-Thorndike Intelligence Tests, 230 


rest _Inven- 


207 


Abilities in 
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Machine, test-scoring, 67-69 
Mem pei em d Test (MAPS), 574 
anifest Anxiety Scale, Taylor, 4 
477, 495 y y 451, 469, 
Manual, 100, 144 
Manual dexterity, see Dexterity 
mu — qe prediction of success, 278- 
27 
Maximum performance, 29, 370 
Maze test, 29, 55 
Mean, 78-79 
Mechanical comprehension, 251, 252, 
281 ff., 341 
See also Bennett Test of Mechanical 
Comprehension 
Median, 75 
Medicine, selection and guidance, 338, 352, 
429, 486 f 
Meier Art Judgment Test, 317 
Memory factor, 256 
Mental ability, see General Mental ability; 
Special ability 
Mental age, 168 ff. 
Mental deficiency, 169, 173, 205 
Mental Measurements Yearbook, 101 
Merrill-Palmer Scale, 207 
Metal Filing Worksample, 306 
Metropolitan Achievement Tests, 384 
Miller Analogies Test, 231, 584 
Minnesota Clerical Aptitude Test, 306 
Minnesota Counseling Inventory (MCI), 
491, 496 
Minnesota Multiphasic Personality Inven- 
tory (MMPI), 458, 468, 469-485, 584 
case interpretation, 472, 491 
Minnesota Paper Form Board, 306, 340 
Minnesota Preschool Scale, 183, 207 
Minnesota Rate of Manipulation Test, 306, 
361 
Minnesota Spatial Relations Formboard, 
273, 306, 340 
Minnesota Vocational Interest Inventory, 
438 
Modern Language Aptitude Test, 821 
Mooney Problem Check Lists, 487, 497 
Motion-picture as testing medium, 393 
Motivation of persons tested, 52-64, 441, 
449, 549, 574 
Multiple Aptitude Tests, 292 
Multiple correlation, 339 f. 
Multiple cutoff, 342 ff. 
Myers-Briggs Inventory, 469 
Matrix test, 215 ff. 


ersonnel, prediction of success, 258, 
Navy s 846, 361, 371, 480 
Need for achievement, 572-574 
Neurotic states, (8485, 562 
Nomination technique, 519 
See also Peer rating 
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Nonverbal tests, 220 
Normal probability curve, 83-85 
Normalized score, 83-85 
Norms, 77, 87-94, 102, 221, 385-388 
expectancy, 72, 387 
grade, 385-387 
profile, 285 
Number factor, 256 
Numerical Operations test, 254, 266, 341 


Objective test of personality, 443 
See also Performance test of person- 
ality 

Objectives of instruction, 368-378, 381 
Objectivity, 23, 65 
Observation, 32, 440—441 

during tests of ability, 187-188, 191 

in field situation, 528-538 

as proficiency measure, 393 

in standard situation, 449—443, 539-560 
Observer error, 393, 506-510, 533-535 
Occupational Interest Inventory, 438 
Odd-even reliability, see Split-half 
Office workers, see Clerical workers 
Ohio State University Psychological Exami- 

nation, 72-73, 231 
Operational Stress technique, 540, 557 
Organic brain damage, 206, 556, 560, 565 
OSS assessment, 567—568, 581-582 
Otis tests of general ability, 220, 221, 222, 
231, 306, 361 


Painting, as projective technique, 541 
Parallel forms, 145 


Parent Behavior Rating Scales, 526-528 
Patients, 


rating scale, 524-525 
test performance, 
472—485, 562 
test procedure, 69 
Pattern interpretation, see Configural scor- 

ing, Profile 
Peer rating, 442, 518-523, 583 
Pencil-paper test, psychomotor, 309 
Percentile scale, 74-78, 87 
computing guide, 74-75 
Perception, tests of, personali 
pw he p lity, 544-558, 
Perceptual style, 544-546, 579-581 
Performance tests, 35 
of mental ability, 192 ff., 202-206 
of personality, 32, 449—443, 539-576 
of proficiency, 354 
Persistence, tests of, 544, 545, 555-556 
Personality, 
in ability tests, 189-191, 200-202, 205 
in interest tests, 428-43] 3 
trait approach to, 466—468, 499-501 


aetunioal behavior $1.90 AAA ane 


187-188, 200-202, 


Personality measures, 31-34, 440-607 
general principles, 440—462 
performance tests, 539-576 " 
predictive validity, 485—487, 542 
projective, 560—566, 569-576, 594 
ratings, 506-528 | 
self-report, 464-504 
stability, 488—489 | 
See also Assessment 
Personality Record, 507 
Personality structure, 32, 500 ff. 
Phenomenological psychology, 464 
Pictorial items, 371 
Picture Frustration (PF) Test, 575 
Pilot success, see Air Force 
Pintner General Ability Tests, 140, 231 
Pintner-Paterson Scale of Performance 
Tests, 208 
Pitch discrimination, 133-134, 344, 372 
Porteus Maze Test, 29, 64 
Power test, 222 
Practice, effects of, 3 Sa 12 
Prediction, 17, 325-3: z ; 
See also Assessment; Engineering, 
selection and guidance; etc. 
Predictive validity, 103-119 
Pre-Engineering Inventory, 322 
Preschool tests, 183, 208-212 
Prescriptive counseling, 297 
Primary mental abilities, 256-258 — E 
Primary Mental Abilities (PMA), 
142, 256-258, 292 
Problem solving, 7, i id 544—560 
Process vs. product, 2! 5, 
Sedona correlation, 112-11 
124 
Product rating, 392 
Product vs. prones a 
Products, rating of, 
Professional apütude tests, 320-322 
Proficiency test, 31, 360—400 
Profile, 86 413- 
interpretation, 200-202, 284-291, 
476, 481—482 
reliability, 275, 287 
Program tests, 99 
pro mee Matrices Test, cow 
Projective techniques, 26, 44 9. 59 
565, 569-576, 579-586, 58 
summary list, 574-575 


Psychiatrie rating scales, 524 coss, 58 586 
Psychiatrists, prediction of suc nde: 25- 


Psychologists, selection and guid 
426, 431, 583-585 
Psychometric testing, 24-28 
See also Clinical vs. sta 
retation 
Psyeomator abilities, 301-314 
Psychomotor factors, 307 308 


ae. ee FM 


" ei^ 
tistical int 


Publishers, 98 

list of, 609 
Purdue Pegboard, 77, 309 
Pursuit Confusion Test, 305 
Pursuit tests, 303-306 


Q-sort, 514-516, 593 
Questionnaires, 34, 405 ff. 


Racial differences in test score, 204 
Range, effect on reliability, 133 
effect on validity, 351 
Rank correlation, 110-1 1l 
Rapport, 38, 60-64, 167, 449 
Rating, 506-528 
of products, 392 
Rating scales, 507, 511, 517, 523-528 
Raven Progressive Matrices Test, 215-218 
Raw score, 69-71 
Reaction time, 301, 307 
Reading tests, 388-392 
Reading, relation to ability test perform- 
ance, 220 
Recognition vs. recall, 373 
Records, anecdotal, 536-538 
Refusal of tests, 167 
Regents examinations, 396-397 
Reliability, 126-142 
of a difference, 275, 287 
and test length, 130-182 
types of, 136-142 
Research use of tests, 7, 20, 494 
Response set, see Response style " 
Response style, 50, 872, $ 
Retest correlation, 134, 190; 
Reviews, test, 101g 555 - 
Rigidity, tests of, > 
Rod ies Frame test, 548, 553, 555, 558 
Role Concept Repertory 
Rorschach method, 
Rotary Pursuit Test, 305, 
Salesmen, success » 495, 485-487 
Sample vs. sign, 4 
Sampling of content 364-366 
ampling error, 1 
Sameie, for observation, 440, 529-530, 
540 
Scales, see Rating scales; 
Scatter, 186 
Scatter diagram, 112-114, 124, 834 
Schemata, 245 
Schizophrenics, test pe 
203, 491, 558, 560 si 
Scholastic aptitude, see General men 
ability; Grade average; prediction à 
Scholastic Aptitude Test (SAT), 87, 5% 


232 
School, achievement, tests of, 360-400 


prediction, 180-181, 205 


Handwriting 


formance, 187, 201, 
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School—( Continued ) 
testing programs, 394—400 
See also College success; Educa- 
on bg mie 
chool and College Ability Tes 7 
140, 230, 236 aii 
Science, selection and guidance, 431 
Score, corrected for guessing, 49 
raw, 69-71 
standard, 80-85 
Scorer reliability, 65 
Scores, combination of, 339—348 
Scoring, 65-69 
Screening, on adjustment, 466, 478—484 
Selection, 18, 325-356 
See also Assessment; Architecture; 
Engineering; etc. : 
Selection ratio, 350 
Selective Service College 
Test, 89 
Self-concept, 294—296, 454—455, 464 
Self-report, 84, 442, 464—504 
as self-description, 444—459, 489—490 
Semantic Differential, 501—504 
Semantic Test of Intelligence, 232 
Sentence Completion Techinque, 575 
aste e. 146, 346, 603 
equential Tests of Educationa 
(STEP) 884 l Progress 
Short Employment Tests, 116, 140 
Shopwork, prediction of success, 278, 306, 
340 ' 


Qualification 


roficiency, 392 

Shrinkage, 355 
Sign vs. sample, 457 
Simple structure, 9255 
Situation, effect on response, 443, 485, 529- 

530, 532 
Situation test, 443 
16 P. F. Test, 497 
Skewed distribution, 84, 135 
Skinner box, 69 
Social class, and ability scores, 237—243 

and interest scores, 498-425 
and motivation, 239-240 
Social desirability, see Facade 
Social intelligence, 189, 319 
Sociogram, 521 
Sociometric rating, 519-522 
See also Peer ratings 
Sophistication of persons tested, 58 
Spatial ability, 256, 276-281 
Spearman-Brown formula, 131, 141 
Special abilities, 30 
See also Clerical; Number 

ific factor, 250, 311-312 
Spectrum of general ability tests, 235-237 
Speed, psychomotor, 303, 306, 307 
Speeding, degree of, 221-223, 306 


Split-half method, 141 
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SRA Achievement Series, 384 
SRA Youth Inventory, 497 
Stability, coefficient of, 136-137, 139-140 
Standard deviation, 78-80 
Standard error, 126-127 
* Standard score, 80-87, 102 
Standardization, 22, 59 
Standardized proficiency test, 394—400 
Stanford Achievement Test, 384, 386 
Stanford-Binet Scale, 47, 66,161, 163-189 
evaluation, 189 
Stanine, 82 
Steadiness, 302, 307, 309 
Stenographic aptitude, 332—333 
Stenography, selection and guidance, 332- 
333 
Store, department, personnel testing, 5, 
360-361 
Strategy, 334 ff. 
Stress situation, 540-541, 557, 567-568 
Strong Vocational Interest Blank (SVIB), 
406—412, 416—435, 438, 448, 489, 584 
Stroop Color Word test, 547, 557 
Structured and unstructured situations, 540 
Studiousness keys, 498 
Study habits, 497 
Study of Values, 35, 140, 489 
Stylistic tests, 569 
Subtle items, 458, 484 


Survey of Study Habits and Attitudes, 497 
Szondi Test, 575 


T-score, 80-85 
Taxonomy of educational outcomes, 374— 


380 

Taylor Manifest Anxiety Scale, 451, 469, 
+477, 495 

Technical Recommendations, 34, 82, 90, 
101-102 


Terman-McNemar Test of Mental Ability, 
220, 221, 233 
Test, choice of, 96-100, 149— — 
858, 408 42-158, 325. 
definition, 21 
form for examining, 147-152 
free response, 26 


of Mechanical Comprehension, see Me- 


chanical comprehension: 
of Mechanica’ omen ve 
multiscore, 145, 602 
objective, 23 
recognition, 26 
situation, 443 
standardized, 22, 394. 400 
Tester, interaction with Subject, 60-64 
qualifications of, 10, 167, 497.499 
Boae costs of, 146-147, 309 
Tests, administration, 37-63 
497-499 imr 
catalogs of, 14-15 


Tests—( Continued ) 
classification of, 29 
distribution of, 9 F., 98 
of Primary Mental Abilities, 142, 292 
sources of information, 14 - 
Thematic Apperception Test, 3, 569-572, 


584, 593 
Thematic tests, 569 ff., 580 
Tilting Room, 548 


Time limits, 47, 145, 221-223 
Time sampling, 530 
Trade test, 282 
Trait approach to personality, 466-468, 
499-501 
Treatments, evaluation of, 19 
True score, 129 ff. 
Two-Hand Coordination Test, 305 j 
Typical Pertenece: tests of, 31, 
403 ff. 
Unique factor, 250 T E 
lease and structured situation’: 540 
Utility, as function of validity, 348-35 
‘or Children, 


370, 


Valentine Intelligence Tests f 
208 
Validation, 27, 96-124 
Validity, 96-124 
concurrent, 104—119 
construct, 104—106 
content, 104, 106, 364—366 
curricular, 397 
differential, 357 
predictive, 103-119 
relation to reliability, 132 
types of, 103-107 
Validity coefficients, 110, 115-116 
acceptable, 348-358 
Validity generalization, 355 
Variance, 80 € 
decomposition of, 130, 138, 224 
sources of, 128 : 
See also Factor analysis 
v:ed factor, 252, 253, 260 
Verbal factor, 256 438 
Vocational Interest Analyses, 


War Office selection boards, 143, 581, 
Water Jar Test, 547, 553, 554 le 
Wechsler Adult Intelligence Sa 
(WAIS), 41, 54, 82, 192- 
202, 248 
Wechsler-Bellevue Scale (WB): 1 
264-267 for children 
Wechsler Intelligence Scale 
( WISC), 192-202, 217 Scales, 524 
Wittenborn Psychiatric Rating * 465 
Woodworth Personal Data Hs 4 321, 393 
Worksample, 304-306, 312-314, 


z-score, 81 


582 


EE Zero, absolute, 72 
e — "Dh 


